ChatGPT Can’t Plan. This Matters. – Cal Newport
Last March, Sebastien Bubeck, a computer scientist from Microsoft Research, delivered a talk at MIT titled “Sparks of AGI.” He was reporting on a study in which he and his team ran OpenAI’s impressive new large language model, GPT-4, through a series of rigorous intelligence tests.
“If your perspective is, ‘What I care about is to solve problems, to think abstractly, to comprehend complex ideas, to reason on new elements that arrive at me,’” he said, “then I think you have to call GPT-4 intelligent.”
But as he then elaborated, GPT-4 wasn’t always intelligent. During their testing, Bubeck’s team had given the model a simple math equation: 7*4 + 8*8 = 92. They then asked the model to modify a single number on the lefthand side so that the equation now equaled 106. This is easy for a human to figure out: simply replace the 7*4 with a 7*6.
GPT-4 confidently gave the wrong answer. “The arithmetic is shaky,” Bubeck explained.
This wasn’t the only seemingly simple problem that stumped the model. The team later asked it to write a poem that made sense in terms of its content, but also had a last line that was an exact reverse of the first. GPT-4 wrote a poem that started with “I heard his voice across the crowd,” forcing it to end with the nonsensical conclusion: “Crowd the across voice his heard I.”
Other researchers soon found that the model also struggled with simple block stacking tasks, a puzzle game called Towers of Hanoi, and questions about scheduling shipments.
What about these problems stumped GPT-4? They all require you to simulate the future. We recognize that the 7*4 term is the right one to modify in the arithmetic task because we implicitly simulate the impact on the sum of increasing the number of 7’s. Similarly, when we solve the poem challenge, we think ahead to writing the last line while working on the first.
As I argue in my latest article for The New Yorker, titled “Can an A.I. Make Plans?,” this inability for language models to simulate the future is important. Humans run these types of simulations all the time as we go through our day.
As I write:
“When holding a serious conversation, we simulate how different replies might shift the mood—just as, when navigating a supermarket checkout, we predict how slowly the various lines will likely progress. Goal-directed behavior more generally almost always requires us to look into the future to test how much various actions might move us closer to our objectives. This holds true whether we’re pondering life’s big decisions, such as whether to move or have kids, or answering the small but insistent queries that propel our workdays forward, such as which to-do-list item to tackle next.”
If we want to build more recognizably human artificial intelligences, they will have to include this ability to prognosticate. (How did Hal 9000 from the movie 2001 know not to open the pod bay doors for Dave? It must have simulated the consequences of the action.)
But as I elaborate in the article, this is not something large language models like GPT-4 will ever be able to do. Their architectures are static and feedforward, incapable of recurrence or iteration or on-demand exploration of novel possibilities. No matter how big we push these systems, or how intensely we train them, they can’t perform true planning.
Does this mean we’re safe for now from creating a real life Hal 9000? Not necessarily. As I go on to explain, there do exist AI systems, that operate quite differently then language models, that can simulate the future. In recent years, an increasing effort has been to combine these planning programs with the linguistic brilliance of language models.
I give a lot more details about this in my article, but the short summary of my conclusion is that if you’re excited or worried about artificial intelligence, the right thing to care about is not how big we can make a single language model, but instead how smartly we can combine many different types of digital cognition.