The Era of Predictive AI Is Almost Over

“Language models are just glorified autocomplete” has been the critics’ refrain — but reinforcement learning is proving them wrong. New breakthroughs could follow.

Dean W. Ball

Artificial intelligence is a Rorschach test. When OpenAI’s GPT-4 was released in March 2023, Microsoft researchers triumphantly, and prematurely, announced that it possessed “sparks” of artificial general intelligence. Cognitive scientist Gary Marcus, on the other hand, argued that Large Language Models like GPT-4 are nowhere close to the loosely defined concept of AGI. Indeed, Marcus is skeptical of whether these models “understand” anything at all. They “operate over ‘fossilized’ outputs of human language,” he wrote in a 2023 paper, “and seem capable of implementing some automatic computations pertaining to distributional statistics, but are incapable of understanding due to their lack of generative world models.” The “fossils” to which Marcus refers are the models’ training data — these days, something close to all the text on the Internet.

This notion — that LLMs are “just” next-word predictors based on statistical models of text — is so common now as to be almost a trope. It is used, both correctly and incorrectly, to explain the flaws, biases, and other limitations of LLMs. Most importantly, it is used by AI skeptics like Marcus to argue that there will soon be diminishing returns from further LLM development: We will get better and better statistical approximations of existing human knowledge, but we are not likely to see another qualitative leap toward “general intelligence.”

There are two problems with this deflationary view of LLMs. The first is that next-word prediction, at sufficient scale, can lead models to capabilities that no human designed or even necessarily intended — what some call “emergent” capabilities. The second problem is that increasingly — and, ironically, starting with ChatGPT — language models employ techniques that combust the notion of pure next-word prediction of Internet text.

For firms like OpenAI, DeepMind, and Anthropic to achieve their ambitious goals, AI models will need to do more than write prose and code and come up with images. And the companies will have to contend with the fact that human input for training the models is a limited resource. The next step in AI development is promising as it is daunting: AI building upon AI to solve ever more complex problems and check for its own mistakes.

There will likely be another leap in LLM development, and soon. Whether or not it’s toward “general intelligence” is up for interpretation. But what the leap will look like is already becoming clear.

The Surprising Results of Scale

In 2017, a small AI research nonprofit called OpenAI made an intriguing discovery. Like in most AI labs at the time, OpenAI’s researchers spent most of their resources on robotics and teaching computers to master games. But something surprised Alec Radford, a researcher working in the backwater of natural language processing, now more commonly known as “language modeling.”

Radford had trained an AI model to predict the next character of a given input sequence using a database of 82 million Amazon product reviews. In doing so, he discovered that he had unintentionally also built a state-of-the-art system that did something else, and that he had not designed it to do. It turned out that to achieve its goal of next-character prediction it was useful for the model to analyze and “understand” the basic emotional valence of the reviews in its training data: knowing that a review was angry, rather than happy, helped the model predict the next character more accurately. Radford re-discovered a truth that has actually been the source of almost all major progress in machine learning since the deep learning revolution that began a decade ago: that unanticipated properties can emerge in systems with simple goals and large scale.

Today’s language models operate in a broadly similar way, except that they predict the next word rather than the next character. (Actually, they predict a sub-word linguistic unit called a “token,” but “word” suffices for our purposes.) The basic theory behind scaling language models further — and spending hundreds of millions, even billions, of dollars to do so — was that, with more data and larger neural networks, models would learn increasingly sophisticated heuristics and patterns that mirror human intelligence.

Perhaps at a certain scale, the models would even learn how to “model” the process that created their training data, verbal intelligence. In other words, by studying trillions of specific selections of text, the model would learn to approximate intelligent reasoning itself. “What does it mean to predict the next token well enough?” asks then–OpenAI chief scientist Ilya Sutskever in a 2023 interview. “It’s actually a much deeper question than it seems. Predicting the next token well means that you understand the underlying reality that led to the creation of that token…. In order to understand those statistics … you need to understand: What is it about the world that creates this set of statistics?”

Radford’s 2017 model contained 82 million parameters, a proxy for a model’s size; GPT-4 reportedly contains about 1.8 trillion. Currently, language models can play chess and other board games, speak nearly every language fluently, and achieve elite scores on standardized tests. They even learn a map of the Earth — a literal world model — and store it within their vast web of mathematical relationships. Evidently, scale can deliver quite a bit.

Importantly, though, there are flaws. Sometimes, models simply memorize sequences of text, particularly ones they see repeatedly. Other times, infamously, they make up plausible-sounding “facts” that are false. Counterintuitively, the memorization of frequently encountered text is a case of the models’ failure, while the so-called “hallucinations” are, in a way, a case of their success. Language models are not intended to be a database of the text in their training data, for the same reason that it is neither expected nor desirable for you to memorize every word of a book you read. We do not want the models to memorize the training data — we want them to model it, to map the relationships and patterns within it. In this sense, all non-memorized LLM responses are hallucinations — that is, plausible-sounding responses. Some hallucinations are desirable, while others, particularly false information presented as fact, are undesirable.

Yet even when an LLM presents factual information in sequences of text that it did not memorize, it is still extremely difficult to know whether it truly “understands” the information. The reality that the models routinely output false information suggests, at a minimum, that their models of the world are flawed, or that they are not appropriately grounded.

How to Ground an AI Model in Reality

Earlier this year, researchers at Princeton University’s Plasma Physics Laboratory announced an important next step on the path toward nuclear fusion. Fusion, which replicates the inner workings of a star to generate electricity, has long been dreamed of as a technology that could transform the economics of clean energy. In a tokamak, the kind of reactor design used by the Princeton team, plasma is heated to more than 150 million degrees Fahrenheit and spun around a doughnut-shaped chamber, often at well above 100,000 miles per hour.

As one can imagine, the inside of a tokamak when it is running is a tempestuous place. Yet the plasma must remain under precise control to keep the fusion reaction going. One common problem is that the magnetic field within the reactor can temporarily “tear,” meaning that plasma particles will escape. To help manage this problem, researchers modulate the magnetic field with a real-time control system. These modulations, however, kick in only once tearing is already occurring, reducing the efficiency of the reactor. To make things worse, the environment is subject to nonlinear dynamics: a modulation that worked at one time might cause the fusion reaction to fail at another. What’s more, these problems must be addressed in milliseconds. Optimizing this process is a perpetual challenge in nuclear fusion development.

The Princeton researchers’ work involved training an AI model to perform this optimization in order to avoid tearing altogether. First, they trained a deep neural network to predict plasma pressure and the likelihood of tearing instabilities on the basis of experimental data. Then, using a technique called deep reinforcement learning (RL), they optimized the model, whose inputs are observed states of the plasma in the reactor, and whose outputs are modulations of the magnetic field to achieve optimal pressure while avoiding tearing. During training, the model’s recommended configurations are graded according to the initial predictions. The RL-based model has a simple goal: to get the best possible grade.

This RL-based model does not “know” physics. It has no equations or theorems of physics explicitly programmed into it. Nonetheless, it can model this staggeringly complex aspect of the real world with better fidelity than earlier approaches, which used computer simulations that are based on formal physics, specifically the fields of magnetohydrodynamics and gyrokinetics. This is the beauty of reinforcement learning: that it enables AI systems to optimize many variables toward a simple goal using real-time data and without explicit knowledge of formal science.

In addition to mitigating plasma instability in fusion reactors, reinforcement learning is at the heart of other recent AI breakthroughs: DeepMind, the AI lab within Google, famously employed RL in the model that achieved superhuman performance at the board game Go.

To what extent can such an optimization system be generalized? What if the same approach could be applied to AI systems that write code, plan and conduct scientific experiments, or craft essays? These are some of the questions at the frontier of language modeling. Reinforcement learning is already challenging, in small ways, the notion of generative AI as just looking at the Internet and predicting the next word. If the current research trends are any clue, they may soon render that notion obsolete.

More Than Next-Word Prediction

Like all technologies that seem like magic from the outside, reinforcement learning is both simpler and more complicated than one might think. It is simpler in the sense that, in the final analysis, it relies on optimizing the value of a single variable: the “reward.” It is more complicated in the sense that choosing what to optimize for, particularly in the context of general-purpose systems like language models, is fiendishly tricky.

The first major foray into the fusion of reinforcement learning and language modeling was the 2022 release of ChatGPT. Ironically, the product that spawned endless claims about how language models just predict the most likely next word from the Internet was actually the first language model that started to break that assumption.

Before ChatGPT, most language models truly were next-word predictors. To prompt those models, one needed to give them a starting sentence and ask them to finish it: “Once upon a time, a brave hero….” These earlier models could be fine-tuned to make them more conversational, but they had a tendency to exhibit toxic behavior or gradually veer off into mirroring the tone of a Reddit commenter rather than a helpful AI assistant. What made ChatGPT a breakthrough consumer technology was a new step in the model’s training process: reinforcement learning from human feedback (RLHF).

RLHF involves collecting human preferences for how a model should respond to prompts — how, in other words, it should behave. Human testers are given two responses to the same prompt and asked to rate which one they prefer. This preference data is used to train a separate neural network called a reward model, which grades the language model’s outputs with a predicted “human satisfaction” score. Finally, the language model’s parameters are adjusted to increase the likelihood of a higher score.

The process might primarily consist of prompts on a range of innocuous topics but could also include more contentious political and moral issues. With a little of this human preference data — actually a lot, but a little compared to how much data is required to train a useful language model — the model’s conduct can be molded in a variety of subtle and unsubtle ways.

Because reinforcement learning from human feedback will change the language model’s parameters (or “weights,” as they are sometimes called), a model that has gone through RLHF is no longer predicting words based purely on statistical analysis of the Internet. The magnitude of the adjustments to the weights is usually small, but as the use of RLHF and other reinforcement learning methods increases, the gap between the statistical map of the Internet and the final language model will increase.

RLHF was essential to making ChatGPT a friendly, helpful, and knowledgeable assistant. But it also comes with tradeoffs. Collecting large amounts of human preference data is prohibitively expensive for all but the largest players in the field. Even for those with the resources to obtain that data, it is not unambiguously clear that human preference data makes models better. The base model of GPT-4, for example, scored 90 percent on the AP microeconomics test, while the RLHF version of the model scored 77 percent, though on a wide-ranging suite of performance benchmarks, the models performed about the same.

There are other drawbacks to the RLHF approach. It can make models more sycophantic, meaning they invent facts they assess the human might like to hear. RLHF can also make models more verbose, because human reviewers seem to prefer longer answers to more concise ones that contain the same information. RLHF can cause models to be mealy-mouthed, refusing to take positions, or inappropriately dodging questions using all-too-common phrases such as “as an AI language model, I cannot….” Google’s Gemini model caused a minor scandal with its refusal to answer questions such as whether the conservative activist Christopher Rufo has hurt society more than Adolf Hitler. (Gemini’s habit of producing racially skewed images, for instance depicting Nazis as black in the interest of diversity, was almost certainly not related to RLHF — it was because Google built its model to emphasize diversity, seemingly by tweaking user prompts behind the scenes automatically.) Meta’s Llama model refused to write code to “kill” a computer process — a term of art, in this context — because, the model said, killing is wrong.

In a technical sense, problems of this kind stem from what is called “overoptimization,” which is when a reward model overshoots the target of modeling human preferences. But there is a deeper problem: To what extent are human preferences useful for training models that are, in some sense, smarter than the average human? If our objective is to use AI systems to advance the boundaries of human knowledge, how much should human preferences factor into the model’s output? Did quantum mechanics comport with human “preferences” about the nature of reality? To what extent, in other words, do human preferences constitute the truth about the world?

Keep reading our Summer 2024 issue

The Amish on AI • Why UFOs • Facts vs. us • EA as self-help • Subscribe

The Coming AI Ouroboros

If we are to use language models to push back the frontiers of human knowledge, it seems likely that something beyond human preferences is required. The obvious candidate is AI models themselves. This approach is known by a variety of names, the most general of which is reinforcement learning from AI feedback (RLAIF). The concept is also sometimes referred to as “scalable oversight.” It is undoubtedly cheaper to use AI than humans for feedback, but some have suggested that it might also be better.

Among the most intriguing applications of RLAIF is the “Constitutional AI” approach from the company Anthropic. Constitutional AI involves embedding human preferences in a set of written principles, the constitution; beyond that single document — to simplify matters a bit — no human preference data is required. Instead, the base model is used to generate responses to prompts, which it is then directed to critique and revise in the context of a randomly chosen principle from the constitution. (If you squint, this is a bit like how constitutional law in the United States works.) The revised answers are then used to train the model further. Finally, the model goes through RLAIF, the AI feedback process — very much like RLHF, except with another AI model picking the best output based on its preferences, rather than a human.

It may sound like an ouroboros, but the results are impressive: The most powerful version of Anthropic’s latest model, Claude 3 Opus, performs better on quantitative benchmarks, such as math and reasoning tests, than any other model. Opus is also a qualitative leap: In March, it became the first model to unseat GPT-4 from the top spot on the LMSYS Chatbot Arena, a popular leaderboard for language models — though an upgraded version of GPT-4 has since reclaimed first place.

Perhaps most intriguingly, Opus has shown remarkable (and, to some, troubling) signs of metacognition and situational awareness. For example, during Anthropic’s routine performance testing, the model recognized the contrived nature of one of its tasks and pointed out in its response that it suspected it was being tested. The model will readily speak with its users about its assessment of the exact nature and extent of these metacognitive characteristics.

A possible explanation for this behavior is that Anthropic seems to treat its models a bit differently than other developers do theirs. Most language models have system prompts written by their developers to give them basic instructions. Almost always, they begin with language such as “You are ChatGPT, a helpful AI assistant.” Anthropic’s system prompt for Claude 3, however, begins simply with, “The assistant is Claude, created by Anthropic.” This raises the question of to whom, exactly, this system prompt is being addressed. To the model? Should the model be considered a distinct entity from Claude, the assistant persona? “The assistant is Claude” may be the most philosophically rich statement in the recent history of artificial intelligence.

Or might the startling new capability of metacognition be at least partially explained by Constitutional AI, by the millions (at the very least) of words the model exchanged with — in essence — itself? Could this have led to an emergent capability of the model to model itself, and hence its own cognitive processes? In a recent article, Anthropic explained that it had used Constitutional AI to train Claude’s “character”: “we can teach Claude to internalize its character traits without the need for human interaction or feedback.”

Many other reinforcement learning–based approaches to improve language model reasoning are in the works. OpenAI, for example, has proposed using a methodology called “process supervision” to improve performance on mathematics — perhaps the biggest weak spot of current-generation language models. This involves giving mathematical reasoning tasks to a model and telling it to show each step of its reasoning. Human labelers then grade each step of the reasoning. These grades are used to train a reward model, which is then used to enhance the original language model. The resulting version of the model performed meaningfully better on mathematical reasoning tasks than the preceding version, which was focused on rewarding the correct answer rather than the correct reasoning process. The next step in development is to use AI techniques in the process supervision, rather than relying on humans — an innovation that DeepMind has recently proposed.

Expectation Management

Current language models are still making next-word predictions based on their statistical representations of the Internet. But as the approaches outlined here play an increasing role in the development of language models, this description will become increasingly unhelpful, and eventually it may fall apart altogether. If approaches like Constitutional AI are widely adopted, it may become more appropriate to think of future language models as the product of several AIs reasoning together and conversing among themselves, with the entire written corpus of human knowledge — our tweets and blogs, our poetry and prose, our wisdom and folly — as the foundation.

We do not know where this path will take us. But it is fair to assume that the coming years and decades may be technologically among the most transformative in recent history. And since AI is likely to underpin that transformation, informed citizens would be wise to keep a close eye on its development, with both vigilance and wonder. To do so, we must be willing to revise our assumptions about what AI is and how it works, as the field changes over time.

It may be comforting to some to think of language models as mere representations of the Internet, as they still largely are today. But the next step in AI development combusts that notion, and those not paying close attention may be in for a surprise as great as the initial release of ChatGPT.