AI’s Ostensible Emergent Abilities Are a Mirage
For a few years now, tech leaders have been touting AI’s supposed emergent abilities: the possibility that beyond a certain threshold of complexity, large language models (LLMs) are doing unpredictable things. If we can harness that capacity, AI might be able to solve some of humanity’s biggest problems, the story goes. But unpredictability is also scary: Could making a model bigger unleash a completely unpredictable and potentially malevolent actor into the world?
That concern is widely shared by many in the tech industry. Indeed, a recently publicized open letter signed by more than 1,000 tech leaders calls for a six-month pause on giant AI tech experiments as a way to step back from “the dangerous race to ever-larger unpredictable black-box models with emergent capabilities.”
But according to a new paper, we can perhaps put that particular concern about AI to bed, says lead author Rylan Schaeffer, a second-year graduate student in computer science at Stanford University. “With bigger models, you get better performance,” he says, “but we don’t have evidence to suggest that the whole is greater than the sum of its parts.”
Indeed, as he and his colleagues Brando Miranda, a Stanford PhD student, and Sanmi Koyejo, an assistant professor of computer science, show, the perception of AI’s emergent abilities is based on the metrics that have been used. “The mirage of emergent abilities only exists because of the programmers’ choice of metric,” Schaeffer says. “Once you investigate by changing the metrics, the mirage disappears.”
Finding the Mirage
Schaeffer began wondering if AI’s alleged emergent abilities were real while attending a lecture describing them. “I noticed in the lecture that many claimed emergent abilities seemingly appeared when researchers used certain very specific ways of evaluating those models,” he says.
Specifically, these metrics more harshly evaluated the performance of smaller models, making it appear as if novel and unpredictable abilities are arising as the models get bigger. Indeed, graphs of these metrics display a sharp change in performance at a particular model size – which is why emergent properties are sometimes called “sharp left turns.”
But when Schaeffer and his colleagues used other metrics that measured the abilities of smaller and larger models more fairly, the leap attributed to emergent properties was gone. In the paper published April 28 on preprint service arXiv, Schaeffer and his colleagues looked at 29 different metrics for evaluating model performance. Twenty-five of them show no emergent properties. Instead, they reveal a continuous, linear growth in model abilities as model size grows.
And there are simple explanations for why the other four metrics incorrectly suggest the existence of emergent properties. “They’re all sharp, deforming, non-continuous metrics,” Schaeffer says. “They are very harsh judges.” Indeed, using the metric known as “exact string match,” even a simple math problem will appear to develop emergent abilities at scale, Schaeffer says. For example, imagine doing an addition problem and making an error that’s off by one digit. The exact string match metric will view that mistake as being just as bad as an error that’s off by one billion digits. The result: a disregard for the ways that small models gradually improve as they scale up, and the appearance that large models make great leaps ahead.
Schaeffer and his colleagues had also noticed that no one has claimed that large vision models exhibit emergent properties. As it turns out, vision researchers don’t use the harsh metrics used by natural language researchers. When Schaeffer applied the harsh metrics to a vision model, voilà, the mirage of emergence appeared.
Artificial General Intelligence Will Be Foreseeable
This is the first time an in-depth analysis has shown that the highly publicized story of LLMs’ emergent abilities springs from the use of harsh metrics. But it’s not the first time anyone has hinted at that possibility. Google’s recent paper “Beyond the Imitation Game” suggested that metrics might be the issue. And after Schaeffer’s paper came out, a research scientist working on LLMs at OpenAI tweeted that the company has made similar observations.
What it means for the future is this: We don’t need to worry about accidentally stumbling onto artificial general intelligence (AGI). Yes, AGI may still have huge consequences for human society, Schaeffer says, “but if it emerges, we should be able to see it coming.”
Stanford HAI’s mission is to advance AI research, education, policy and practice to improve the human condition. Learn more.