Where will LLMs take us?

Not a week seems to pass by without some surprising news concerning large-language models (LLMs). Most recently it was when an LLM trained for other purposes played chess at a reasonable level. This seemingly constant stream of surprising news has led to talk that LLMs are the next general-purpose technology—a technology that affects an entire economy—and will usher in new era of rapid productivity growth. They might even accelerate global economic growth by an order of magnitude, as the Industrial Revolution did, providing us with a Fifth Industrial Revolution.

Given all this enthusiasm it’s not surprising that there’s a rush to develop LLM-powered autonomous agents, mechanical servants to prosecute tasks on our behalf. Developers are enhancing LLMs with vector databases and knowledge graphs to provide them with memory and domain-specific expertise. LLMs are being used to decompose goals into tasks and plan actions.

We’ve seen an explosion of creativity in the short year since ChatGPT was first released. This wealth of activity leads some pundits to predict that artificial general intelligence (AGI), which is best read as human-like intelligence in machines, is only a few years away, perhaps as early as 2030. Others are concerned that LLM-powered agents will become the paperclip maximises Bostrom et al fear, arguing that we should contain these technologies—regulate them, limit access, control development, and ensure alignment—if they’re to be a force for good (rather than evil).

LLMs are powerful, but they are also limited. An LLM playing chess will make illegal or even simply naive moves. Those decomposing tasks are easily confused and become stuck in loops. And so on. It’s easy to be surprised when LLMs fail in this way, likely as we were also surprised by just how capable they were in the first instance and so over-estimated their abilities. Throwing more technology at the LLMs—more weights and so more compute power, more sophisticated prompts, larger and better-tuned training sets, integrating external knowledge sources—reduces the problems somewhat but does not eliminate them.

It’s important to remember that LLMs are creatures of language. Given some text—a prompt—they can predict a following statement. This means that any problem we reframed as a language problem is amenable to LLM language prediction. LLMs can transform bullet points or an outline into an essay or describe how a task might be decomposed into subtasks. They can predict a next move in a game of chess described in Algebraic Notation. We might outline the Towers of Hanoi problem and ask an LLM to translate the description into Planning Domain Definition Language (predict what a PDDL description of the problem would look like) which we feed to a planning engine for the final solution. Or two 3D models can be integrated to create a single object by using an LLM to predict how a unified model might be described. The ‘Hello World’ of LLMs seems to be a PDF summariser—taking a document described in PostScript, extracting the natural language text, and then summarising the result. Indeed, there are many technical languages that an LLM can manipulate and translate between. Programming is a popular example, where we take a description of a problem (which could be textual or a wireframe) and have the LLM translate this into code. This is a thread that we haven’t pulled on hard enough yet.

On the other hand, the fact that LLMs are creatures of language is also the source of their limitations.

Language is incredibly powerful—it can even seem to have magical abilities. A book enables us to enter somebody’s head and know what they think. Language can be used to embody knowledge in descriptions and instructions that others can learn from. It’s commonly assumed that what we hear in our head is what we think—I think (have an internal monologue), therefore I am—that language is the root of consciousness. (Though you might be surprised to learn that not everyone has an internal monologue.) It’s this assumption that language is all of consciousness that leads to the belief that—given a sufficiently powerful LLM—we will have created a general AI.

Language is a powerful tool, but it is not particularly precise. There’s synonyms and homonyms, both which create confusion. We have high and low context languages, where what is unsaid can be as (if not more) important that what is said (relying on shared experience to convey meaning). Most importantly, while language is an important part of what makes us human it does not capture all that being human entails.

There are places that language cannot go, ideas and feelings that we cannot use it to express. Lacan calls this place The Real—experience which cannot be captured by symbols and language. This idea of things beyond language pops up in all sorts of annoying places. You can see it rear its head when Herbie Hancock and Jacob Collier discuss music harmony, as they soon run out of words to describe experience and resort to exclamations woven between chords stabbed on the keyboards they’re sitting at. The Real is experience that eludes our attempts to represent it, to reify it in language.

We’re constantly skating along the edge of language as we find our way through the world. Consider something seemingly as simple as defining ‘seat’. Dictionaries (and LLMs, when asked) provide expansive definitions as they try to account for the wide range of things that we might consider a seat. A better (and more concise) approach is to simply define ‘seat’ as a thing that affords sitting, something we can sit on. This thing could be a chair, a flat rock, or a conveniently shaped tree root—the thing only becomes a seat (and so part of language) when we realise that we can sit on it. We define ‘seat’ by referring directly to shared experience.

An LLM-powered agent will stumble when it encounters something undescribed or indescribable. An ill-defined term, task, or some language confusion can easily confound its language prediction. Awkward naming in the two 3D models we’ve asked an LLM to integrated can result or random (and physically impossible) interconnections. An LLM might teleport a threatened chess piece to safety. And so on.

LLMs will struggle if we cause them to go beyond what is explicitly stated in their prompt, when they run out of grounded references to predict from. Humans can directly refer to experience and so find their way through. LLMs, on the other hand, must be provided with suitable external references if they are to self-correct their language-driven reasoning. An LLM trained on “A is to B”, for example, fails when predicting what “B is to A” if the connection between the two is not explicitly stated in the prompt.

Similarly, we might prompt an LLM to remember something under the assumption that its training has caused it to record a fact into its trillions of weights during training. This is not the case though. Rather than being recalled, a memory is created when a prompt interacts with the LLM’s language prediction processes. The memory is in the prompt (the words) as much as it is in the weights, much as how a smell (our prompt) can evoke a memory (creation) of a long-forgotten moment.

We might even view memory as a process rather than as a thing, making ‘memory’ more a verb than a noun. An LLM doesn’t recall a famous work of art—it constructs a new representation based on how the prompt interacts with the patterns encoded in its trillions of weights, patterns that are shadows of the many representations of and conversations about the work of art which used to train the LLM. This means that LLM memory is fallible—much like human memory—that LLMs are prone to false memories and other types of confabulation when predicting without a suitable external reference as memories (facts) are fabricated as needed.

Much of the excitement (and fear) around LLMs is due to confusing language prediction with reasoning and rationality—we approach LLMs as synthetic humans rather than powerful language manipulation tools.

We can, for example, have a natural conversation with a LLM-powered chatbot. (Unlike conventional state-machine-based chatbots, which are restricted to awkward and obviously artificial conversations.) The human might even convince themselves that the LMM chatbot is conscious, self-aware. However, we cannot ignore how the chatbot is a passive and aimless conversationalist, following the human through the twists and turns of the human’s interests rather than seeking its own goals. It only provides opinions when asked, for example, not when it desires to. The chatbot is reacting to the human, predicting the most likely response, not guiding, nudging to even driving the conversation where the chatbot (or its makers) want the conversation to go. It lacks agency—no more than a mirror held up to the human conversationalist. Efforts to make LLM-powered chatbots goal-seeking via prompt engineering (which boils down to telling the chatbot that it should be goal seeking) haven’t worked. Nor does integrating external knowledge sources (via RAG or knowledge graphs) solve this problem (though it does help ground language prediction, and so reduce false memories and confabulations). Language is an important part of intelligence, but we must be careful that we don’t confuse language (and language prediction) with agency and consciousness.

On the other hand, if we treat LLMs as language manipulation tools then they can be part of all manner of powerful solutions. It shouldn’t have been surprising that an LLM could play chess—after all, there’s a long history of chess analysis & notation and a vibrant online chess community, all of which is likely within LLM training sets. Nor should it be surprising if an LLM is shown to juggle—there’s a vibrant online juggling community and juggling notation to support training an LLM, and the LLM should be capable of manipulating juggling notation and translating it into the G-code used to drive a robot juggler.

Many of the LLM-power solutions we’ll use in mid- to long-term are yet to be imagined—we’ve only scratched the surface of what might be possible. The LLM, however, is likely only one part of these solutions. A chess-playing LLM needs to be grounded by connecting it to the rules of chess—via inclusion in its prompt, or by verifying any moves it suggests—to prevent illegal and naïve moves. Something needs to prevent our juggling LLM from specifying impossible, or even just simply dangerous, moves. If we use an LLM to predict how two physical components could be integrated (via predicting the unified 3D CAD description), then the result needs to be validated before it is used. An LLM is also likely to be an important component in future chatbots, but it will need to be complimented by other components that enable the chatbot to be goal-directed. And so on. LLMs are clearly powerful tools that can be applied to all manner of problems we’ve yet to imagine. They might even usher in a Fifth Industrial Revolution, though only time will tell if that’s the case. However, the AGI that many hope for will continue to be harder than we think.