A couple of weeks ago, Anthropic released a new research paper on their Transformer Circuits Thread. I like these papers. The mechanistic interpretability team at Anthropic are consistently doing some of the most interesting work in the field. The paper showed that LLMs contain features which, when active, direct the models to produce patterns of text associated with different emotions. That’s neat. It fits in with their work on personas last year. It’s good to know what features a model contains, what kinds of settings they can be put into, and how that influences their output.1 What was controversial about the paper was the term they used for these features, ‘functional emotions’. To avoid any misunderstandings though, the paper included the following clarification:
“We stress that these functional emotions may work quite differently from human emotions. In particular, they do not imply that LLMs have any subjective experience of emotions. Moreover, the mechanisms involved may be quite different from emotional circuitry in the human brain–for instance, we do not find evidence of the Assistant having an emotional state that is instantiated in persistent neural activity (though as noted above, such a state could be tracked in other ways).”
But the response was predictable. Wired’s headline was ‘Anthropic Says That Claude Contains Its Own Kind of Emotions’, Forbes went with ‘Exploring The Strange Uncharted Waters Of Claude’s Emotions’. Shannon Vallor wrote an interesting thread on Bluesky about it here which highlighted the different interpretations of emotion at play and ended “if they were being remotely responsible they’d fall over themselves to point out what I just did, to remove any possible misunderstanding and ensure the deflationary interpretation by media and users. Instead they whisper ‘hmm, who knows?..'” Anthropic know that people are developing unhealthy relationships with the technology that can lead them to attribute states and capacities that aren’t warranted. They acknowledge that clarity is important and that people shouldn’t be misled. But nevertheless, they disavow the strong reading in the text (while still winking at the audience in the framing).
This is a common pattern. My favourite example is a paper that shows that when you prompt a model to generate text that breaks the question down into distinct parts and responds to those parts directly before producing an answer, it outputs the right answer more often. This makes sense and it’s a neat hack. If a model gives a preamble before an answer, then it has more information to draw on when producing that answer. The preamble can work a bit like a memory buffer for storing relevant information. The basic operations involved haven’t changed, the model is still predicting the next word in a sequence, but it’s still a useful way of increasing the accuracy of model outputs. At the end of the paper, the authors write:
“[W]e first qualify that although chain of thought emulates the thought processes of human reasoners, this does not answer whether the neural network is actually “reasoning,” which we leave as an open question.”
Again, we have ‘reasoning’ in scare-quotes, and to avoid people getting carried away and misunderstanding the significance of the paper, they clarify that this doesn’t actually tell us if the systems are ‘reasoning’. After all, it isn’t doing anything fundamentally different from its usual next-token prediction. So, to be clear, we don’t know if the model is actually reasoning. Then you look at the title of the paper:
“Chain-of-Thought Prompting Elicits Reasoning in Large Language Models”.
No scare-quotes, no caveats, no clarification. It’s (understandably) an incredibly influential paper.
So what’s going on here? Well, as J.L. Austin put it, ‘there’s the bit where you say it, and the bit where you take it back’. Online, this is called a motte and bailey strategy. You make an incredibly strong claim, perhaps in the title, and in the body of the text you disavow the obvious reading either explicitly or through the use of scare-quotes. We coined a new term for it here.
There was a time when scientists would complain that lazy science journalists, pursuing clicks, had misrepresented their research to make it look like they had shown something they hadn’t. NLP researchers now push language to extremes in order to achieve the same results.
You can imagine all sorts of ways this could play out. Say I’m curious about how models respond to culinary terms. So I construct a set of prompts with the format ‘You are eating a plate of X and talking about mathematics. Someone asks you Y and you respond: ” I then vary the foods in the X position {lutefisk, durian, coddle, hákarl, gagh…, tacos, porridge}. Then perhaps I find some statistical correlation between the food in the prompt and the model’s success at some mathematical benchmark. I could then entitle the paper, ‘LLMs mathematical performance is sensitive to diet’, ‘LLMs can develop indigestion’, ‘LLMs aversion to fermented foods undermines rationality’. Perhaps I can get someone to invest in developing antacids for the language models. Maybe we can find a prompt that corrects for the effect so I can start calling this prompt an antacid. I’ll make sure to put it in scare-quotes the first few times I do this (but I’ll drop them later on).
Okay, I couldn’t help it, I did the study here.
I’m not keen on this. I think it misleads the public and policy makers about scientific research and is generally a bad thing. It’s clear that this is being done knowingly, otherwise people wouldn’t be putting things in scare-quotes or adding exculpatory comments within the papers denying the obvious interpretation of what they say. At best, defining terms like ‘belief’ in such a way that the standards for producing machines that believe are so low has all the virtues of theft over honest toil. My co-authors did a much better job of saying why it’s bad than I could.
Now, I’m not saying that it’s impossible for a machine to reason, have beliefs, emotions etc. Our paper was about a rhetorical technique, not metaphysics. But I want to deal here with some general responses I imagined and to clarify what the paper is not saying. To be clear, these are my own thoughts and shouldn’t be blamed on my co-authors. They may disagree. I’m just trying here to make some of the voices that were in my head into someone else’s problem. Every claim here deserves more thorough treatment than I have space for, but there’s a whole field where philosophers fight this out and I’m not going to wade in too deep in a blog.
I’ll start with what I take to be the weakest responses.
Voice 1: We’re taking the Intentional Stance. The best way of explaining some complex systems is to ascribe them beliefs and desires.
Response: The best way of understanding these systems is a well-crafted ablation/intervention study or maybe using a sparse auto-encoder. Stipulating that you are going to refer to some internal processes with everyday cognitive terms is not ‘taking the intentional stance’. At least not in the sense that Dennett meant it. There is a difference between ascribing beliefs to an agent because it best explains how they rationally act upon their desires and deciding to call the activation vectors that fall on one side of a hyperplane ‘beliefs’ because that sounds exciting and provocative.2
Voice 2: People deserve a little anthropomorphism as a treat. I say my laptop is angry, my phone has died, the kettle is phlegmatic… We extend cognitive these terms all the time. To quibble is pedantry.
Response: Science runs on pedantry. Disenchantment is good. People have worked hard to achieve an image of the world where people don’t posit magic spirits behind complex phenomena. We shouldn’t abandon this.
Then there are some responses that I think are more serious.
Voice 3: Look, you have too anthropocentric a concept of X, Y, Z. There have been serious problems in the past with researchers refusing the extend cognitive terms to animals. Didn’t Descartes view critters as clockwork and wasn’t he wrong to do so? Isn’t there a risk to being dismissive?
Response: While I do think that anthropocentrism charge can been wielded as a general get-out-of-jail-free card by some philosophers, there is an important point here. In many cases, we want cognitive categories to apply across species with different neural architectures. So by analogy, perhaps we’re just doing the same here.
Now, there are some obvious differences in the case of language models. There is, broadly speaking, a difference between living things and non-living things and we generally preserve cognitive language for living things. There are good reasons to assume that some important cross-species traits are either results of shared ancestry or have convergently evolved because organisms (again, still with some shared ancestry) faced similar challenges or lived in similar environments. None of this holds for language models.
But all of this misses the point. The objection we raise isn’t that silicon systems can’t have cognitive properties. I’m personally not proposing the theoretical impossibility of any form of multiple realisability for cognitive states. What I’m saying is that the onus is on the person who wants to speak this way to show why this way of characterising things provides greater clarity or explanatory power or whether it is more likely to mislead journalists and policy makers about the nature of your product. As a general rule, if you are finding yourself constantly using scare-quotes, there’s reason to doubt you’re adding clarity.
Voice 5: How do you know these systems don’t X, Y, Z? Critics are exhibiting an unwarranted confidence. These systems look like they are X, Y, Z. We have just as much a right to apply this language to them as to humans.
Response: I think this approach is interesting because it mirrors a similar ‘how-do-you-know?’ move that creationists would make against evolutionary biologists. The creationists would say, ‘this phenomenon looks like it is the product of intelligence, the onus is on the biologist to show that it isn’t’. And obviously this is unfair. It’s easy to characterise appearances, working out reality takes time and effort. But the dutiful biologists would work case-by-case to explain the phenomena that could only be explained by appeal to an immanent intelligence; the eye, flagella, bombardier beetles etc. and the creationist was always allowed to move the goal posts once a mechanistic explanation had been given. The onus was always on critics to disprove the claim that there was a ghost behind the mechanism.
It is always easier to adopt a behavioral criterion for something because doing so typically amounts to saying ‘It sure looks like the thing does X’ while the person who wants to refute you has to go about developing an account of the internally necessary conditions for X in order to respond to you. The creationist can say ‘it looks like intelligent’, while the biologist has to develop explanations for why it isn’t. The AI proponent can say ‘it looks like it has beliefs’, while their respondent has to explain that this isn’t what our best theories of consciousness, belief, introspection etc suggest.
The irony behind this is that we largely know how LLMs work thanks to teams of NLP researchers like those at Anthropic. When they show us the steps by which LLMs generate their outputs, they demystify the process a bit more and should be praised for this. When they unpack the deterministic process by which inputs activate features which determine downstream computations, they dispel more of the magic. If *this* word is in the input text, it will trigger *this* feature with triggers this one… and that increases the probability that the topic is X and so on. One mechanistic interpretability paper from Anthropic does more to undermine the idea that LLMs are intelligent thinking things than a hundred posts about stochastic parrots or fancy autocomplete by AI-critics. I think AI critics should love mechanistic interpretability research. Once you cut past the cognitive window-dressing, there is seriously interesting science going on in mechanistic interpretability research.
And if you think similar reasoning applies to a human, then I can recommend some fun introductory books on neuroscience but the TLDR is that brains are vastly, vastly, more complex than any existing machine learning models.
Voice 6: Look, sometimes you have to operationalise a term. As a Sellarsian™️, you should understand that we’re not going to get a neat, one-to-one mapping from the vocabulary of the manifest image to the scientific image. And you’re not going to get everything you want from an operationalization. Learning to live with this dissatisfaction is an important part of coming to terms with modernity. Developing more precise definitions that lack some of the normative significance of the original term is a good thing.
Response: Okay, so I do believe this. There is something spiritually important in Carnap’s Logical Foundations of Probability akin to recognising dependent-origination or impermanence. But let’s catch ourselves on for a moment. That’s not what’s going on here. There’s a difference between developing a method for measuring some previously unobservable that you have reason to believe is present in a system, and stipulating that some process that you can observe is thinking/believing/hallucination etc.
Now for the response that I think is the best:
Voice 7: LLMs are models of people so it’s only natural that they have components that correspond to features people and only reasonable that we would use the vocabulary developed to describe people to describe them.
Response: One way to output similar responses to humans is to track variables concerning the internal states of humans that give rise to different patterns of response. To be an effective statistical model of language, an LLM develops a model of language generators, i.e. humans, and this means that there are going to be features in the LLM that correspond to features of humans. This is what is happening in the category of functional emotions etc. The model acquired a feature during training that corresponds to a property of language generators.
When we use cognitive language to describe what the language model is doing, we are engaged in a sort of double-modelling. The LLM develops an internal model of human agents, and we, in turn, appeal to internal properties of human agents to characterise what is going on inside LLMs. In other words, we use people as models of language models and they work for this because language models are models of people. Talk of thoughts, understanding, reasoning etc, is simply a shorthand. As such, it’s much more innocent than you are suggesting. Using cognitive terms when characterising LLMs doesn’t involve any strong claims about the multiple realisability of mental states. It’s just the same as talking about moist convection in a climate simulation.
This may be the case, though, it’s an open question what aspects of human language users LLMs do model. But more importantly, a model is not the same thing as the thing it models. An epidemiological model is not a disease. A meteorological model is not a flood. Mistaking the model for the thing it models is to be discouraged, especially when the stakes for confusion are high.
Daniel Dennett would quote Lee Siegel’s book on magic when talking about consciousness. I think the same general principles apply when talking about artificial intelligence in NLP.
“I’m writing a book on magic”, I explain, and I’m asked, “Real magic?” By real magic people mean miracles, thaumaturgical acts, and supernatural powers. “No”, I answer: “Conjuring tricks, not real magic”. Real magic, in other words, refers to the magic that is not real, while the magic that is real, that can actually be done, is not real magic.”
The use of optimized stochastic gradient descent to fit massively parameterised statistical models of linguistic behaviour is real magic, and it deserves real science.
- The traditional story is that this is good for safety though it’s probably worth flagging that ‘safety’ in this case means ‘control’. It helps to know how a model works if you want to steer it to work in a certain way. This knowledge is useful whether you want to be less racist or inaccurate but also if you want it to be more racist or to produce the right kind of disinformation, whether you want it to promote Coca Cola, Ivermectin, or the De Broglie-Bohm interpretation of quantum mechanics. Controlling the output of language models provides indirect control over those who have become dependent upon them for information. As such, mechanistic interpretability is the science of a new and important form of social control.
- The ship in Star Trek has a linguistic interface. No one suggests it has a mind or is intelligent (at least until S7E29 of TNG and this episode wasn’t very good). One thing that we seem to be learning at the moment is that we can interact with computers that have linguistic interfaces without assuming the intentional stance This is something that is interesting about chatbots and seems to undermine some ideas in Davidson, Dennett and others. They are perfectly functional whether or not you actually believe they have beliefs and desires. Personally, I prefer to view our interactions as a kind of useful make-believe.