Psychology and neuroscience applied to LLMs

Reading: How does ChatGPT ‘think’? Psychology and neuroscience crack open AI large language models, Nature news feature, 14 May 2024. Available as an audio long read at YouTube.

I was drawn to this news feature as it seems the opposite of ideas I’ve been reading about lately. I’ve been interested in how concepts from engineering help us better understand biology (Can a biologist fix a radio? and Reverse engineering of biological complexity), and here we’re looking at ideas from psychology and neuroscience to understand technology.

The whole article is a discussion of explainable AI (xAI) and the need to have an understanding of how LLMs work.No real conclusions, but some fun references.

On psychology

“The human mind is a black box, animal minds are kind of a black box and LLMs are black boxes,” says Thilo Hagendorff, a computer scientist at the University of Stuttgart in Germany. “Psychology is well equipped to investigate black boxes.”

This references the “machine psychology” paper, suggesting we treat LLMs as we do psychology subjects in order to access their behaviour. This is tricky, as LLMs have consumed a lot of psychology research to mimic. On the other hand, that’s useful for simulating people, as covered in a great Science piece on synthetic subjects.

On neuroscience

Another angle is using imaging concepts from neuroscience (paper):

The researchers told their LLM several times to lie or to tell the truth and measured the differences in patterns of neuronal activity, creating a mathematical representation of truthfulness […]

With the benefit of being able to made edits:

The researchers went further and intervened in the model’s behaviour, adding these truthfulness patterns to its activations when asking it a question, enhancing its honesty. They followed these steps for several other concepts, too: they could make the model more or less power-seeking, happy, harmless, gender-biased and so on.

Looks like a powerful technique. Anthropic’s work on transformer circuits is related, discovering “monosemantic" feature patterns from LLMs. Looking at neuron’s doesn’t work, they say:

Unfortunately, the most natural computational unit of the neural network – the neuron itself – turns out not to be a natural unit for human understanding. This is because many neurons are polysemantic: they respond to mixtures of seemingly unrelated inputs. (From Towards Monosemanticity: Decomposing Language Models With Dictionary Learning, 2023)

In the recent 2024 update Anthropic report (published since the Nature news feature):

Examples of features we find include features for famous people, features for countries and cities, and features tracking type signatures in code. Many features are multilingual (responding to the same concept across languages) and multimodal (responding to the same concept in both text and images) […]

Importantly:

Some of the features we find are of particular interest because they may be safety-relevant – that is, they are plausibly connected to a range of ways in which modern AI systems may cause harm.

Being able to detect when an LLM is “dishonest” from analysing the network activations? Sounds incredibly useful, and still neurosciencey in that you’re probing a network rather than engineering a system. Still: “This research is also very preliminary”.