close
close

Khiltsik: Apple is considering why artificial intelligence is needed

Khiltsik: Apple is considering why artificial intelligence is needed

See if you can solve this arithmetic problem:

On Friday Oliver picks 44 kiwis. Then on Saturday he picks 58 kiwis. He picked twice as many kiwis on Sunday as he did on Friday, but five were slightly below average. How many kiwis does Oliver have?

If you answered “190”, congratulations: you did as well as the average elementary school student by answering correctly. (Friday 44 plus Saturday 58 plus Sunday 44 times 2, or 88, equals 190.)

You also performed better than more than 20 state-of-the-art AI models tested by Apple’s AI research team. They found that the AI ​​bots were consistently making mistakes.

The fact that Apple did this has gotten a lot of attention, but the results shouldn’t surprise anyone.

—AI critic Gary Marcus

The Apple team found a “catastrophic drop in performance” on these models when they tried to parse simple math problems written in essay form. In this example, the systems that were asked this question often did not understand that the size of the kiwi has nothing to do with the number of kiwis Oliver has. Some therefore subtracted five low-growing kiwis from the total and answered “185”.

Human students, the researchers say, are much better at detecting the difference between relevant information and irrelevant curves.

Apple’s findings were published earlier this month in technical document This has attracted widespread attention in artificial intelligence labs and in the press, not only because the results are well documented, but also because the researchers work for a leading national consumer technology company that has just emerged. presented a set of proposed artificial intelligence functions for iPhone users.

“The fact that Apple did this got a lot of attention, but no one should be surprised by the results,” says Gary Marcus, a critic of how artificial intelligence systems are marketed as reliable, that is, “intelligent.”

Indeed, Apple’s finding is consistent with earlier research that showed that large language models, or LLMs, don’t actually “think” so much as they match language patterns in the materials they’re given as part of their “training.” When it comes to abstract thinking—“a key aspect of human intelligence,” according to Melanie Mitchell, an expert on cognition and intelligence at the Santa Fe Institute—the models fall short.

“Even very young children can learn abstract rules with just a few examples.” Mitchell and his colleagues wrote last year after subjecting the GPT bots to a series of analogy puzzles. Their conclusion was that “a large gap in basic abstract reasoning still remains between humans and modern artificial intelligence systems.”

This is important because LLMs like GPT are at the heart of AI products that have captured public attention. But the LLMs tested by the Apple team were consistently misled by the language models they were taught.

Apple researchers set out to answer the question: “Do these models really understand mathematical concepts?” as one of the leading authors Mehrdad Farahtabar put it thread on X. Their answer is no. They also wondered if the shortcomings they identified could be easily corrected, and their answer was also no: “Could scaling up data, models, or computation fundamentally solve this problem?” – Farajtabar asked in his thread. “We don’t think so!”

The Apple study, along with other findings about the cognitive limitations of artificial intelligence bots, is a much-needed corrective to the marketing claims coming from companies touting their artificial intelligence models and systems, including OpenAI and Google’s DeepMind lab.

Promoters typically portray their products as reliable and their products as trustworthy. In fact, their results are consistently suspect and present clear dangers when used in contexts where the need for strict precision is absolute, such as healthcare applications.

This is not always the case. “There are problems that you can make a lot of money on without having a perfect solution,” Marcus told me. Recommendation systems powered by artificial intelligence, such as those that direct Amazon shoppers to products they might also like. If these systems make the wrong recommendation, that’s okay; a buyer might spend a few dollars on a book he doesn’t like.

“But a calculator that only works correctly 85% of the time is garbage,” says Marcus. – You wouldn’t use it.

The potential for wildly inaccurate results is increased by the natural language capabilities of AI bots, which allow them to offer even absurdly inaccurate answers with convincingly self-confident gusto. They often double down on their mistakes when challenged.

AI researchers typically describe these errors as “hallucinations.” This term may make errors seem almost harmless, but in some applications even a small error rate can have serious consequences.

Research scientists came to this conclusion recently published Whisper analysisan AI-powered speech-to-text tool developed by OpenAI that can be used to transcribe medical discussions or prison conversations monitored by correctional officers.

The researchers found that about 1.4% of Whisper-transcribed audio segments in their sample contained hallucinations, including adding completely fabricated statements to the transcribed conversation, including images of “physical violence or death… (or) sexual innuendo” as well as demographic stereotypes. .

This may seem like a minor flaw, but researchers have noted that errors can be included in official records, such as court transcripts or prison phone calls, which can lead to official decisions being made based on “phrases or statements that the defendant never said.” ” »

Updates to Whisper in late 2023 improved its performance, but the updated Whisper “still hallucinated regularly and reproducibly,” the researchers said.

That hasn’t stopped artificial intelligence proponents from unreasonably boasting about their products. IN tweet from October 29Elon Musk invited followers to send “X-rays, PET scans, MRIs or other medical images to Grok (an artificial intelligence application for his social media platform X) for analysis.” Grock, he wrote, “is already quite accurate and will become extremely good.”

It goes without saying that even if Musk is telling the truth (which is not an absolutely certain conclusion), any system used by healthcare providers to analyze medical images would have to be much better than “extremely good,” as that standard might be defined. .

This brings us to the Apple study. It is pertinent to note that the researchers are not critics of AI per se, but believe in the need to understand its limitations. Farajtabar was previously a senior scientist at DeepMind, where another author interned under him; other co-authors have advanced degrees and professional experience in computer science and machine learning.

The team used their AI domain models with questions drawn from a popular collection of more than 8,000 primary school arithmetic problems that test students’ understanding of addition, subtraction, multiplication and division. When tasks included items that might seem relevant but weren’t, the models’ performance dropped sharply.

This is true for all models, including versions of the GPT bots developed by OpenAI, Meta’s Llama, Phi-3 from Microsoft, Gemma from Google and several models developed French laboratory Mistral AI.

Some did better than others, but all showed a decline in performance as the problems became more complex. One of the problems was a basket of school supplies, including erasers, notebooks and writing paper. This requires the solver to multiply the quantity of each item by its price and add them up to determine how much the entire basket is worth.

When the bots were also told that “prices were 10% cheaper last year due to inflation,” the bots reduced the price by 10%. This gives the wrong answer because the question is asking how much the basket will cost now, not last year.

Why did this happen? The answer is that LLMs are developed or taught by exposing them to vast amounts of written material taken from published works or the Internet, rather than by attempting to teach them mathematical principles. LLMs work by collecting patterns in data and attempting to match the pattern to the question being asked.

But they “re-adjust to their training data,” Farahtabar explained via X. “They remember what’s online, match patterns, and respond according to the examples they’ve seen. It is still a (weak) type of reasoning, but according to other definitions it is not a true reasoning ability.” (brackets it.)

This will likely place boundaries on what AI can be used for. In mission-critical applications, humans almost always need to be “in the know,” as AI developers say—checking answers for obvious or dangerous inaccuracies, or providing recommendations that prevent bots from misinterpreting their data, misreporting what they know, or filling out information. gaps in your knowledge with fabrications.

In a way, this is reassuring because it means that AI systems won’t be able to achieve much without human partners at hand. But it also means that we humans need to be aware of the tendency of AI advocates to overestimate the capabilities of their products and hide their limitations. The problem is not so much what AI can do, but how users can be tricked into thinking what it can do.

“These systems will always make mistakes because hallucinations are inherent in them,” says Marcus. “The way they approach reasoning is an approximation, not a reality. And none of this will go away until we have some new technology.”