The mind readers

Researchers look inside Claude's head.

Christian Hansen

8/26/20255 min read

Wissenschaftler mit Mikroskop untersucht Laptop, um Funktion von KI zu verstehen. Comic-Look.

The mind readers

How three researchers are looking into Claude's digital brain.

Three scientists at Anthropic want to understand what really goes on inside an AI model. Their methods are reminiscent of brain research and psychology – except that their patient is made of code.

The question sounds simple: when we chat with Claude, who are we actually talking to? A glorified autocorrect? A search engine? Or something that actually thinks – perhaps even like a human being?

‘Disturbingly, no one knows for sure,’ says the moderator at the beginning of a remarkable conversation at Anthropic. Seated at the table are three researchers from the Interpretability team: Jack, a former neuroscientist; Emanuel, who built machine learning models and is now trying to understand them; and Josh, who came to the ‘biology of organisms from mathematics’ via virology and mathematics.

30 million concepts in mind

What drives these three is more than scientific curiosity. Using sophisticated analysis methods, over 30 million different ‘features’ – internal concepts – have already been identified in Claude. These include: a circuit that is exclusively responsible for ‘flattering praise’. Another that detects programming errors. And one that is always activated when 6 + 9 has to be calculated – regardless of whether it is asked directly or hidden in the calculation that volume 6 of a magazine founded in 1959 was published in 1965.

These concepts are astonishingly universal: if you ask for the opposite of ‘big’ in English, French or Japanese, the same internal circuits are activated. The model has developed an abstract concept of ‘big’ which it then translates into different languages – not ten different versions for each language.

The riddle of the next words

‘People often think it's like a video game with pre-programmed responses,’ explains Josh. But Claude was never programmed to say, ‘If the user says “Hi”, respond with “Hello”.’ Instead, the model was trained with enormous amounts of text, always with one goal in mind: to predict the next word.

Emanuel draws a surprising parallel with evolution: ‘From an evolutionary perspective, the ultimate goal of humans is survival and reproduction. But that's not how you think about yourself.’ It's similar with Claude: the model doesn't consciously think, ‘I have to predict the next word.’ It has developed internal intermediate goals and abstractions that help it fulfil this meta-task.

Plan A and Plan B: When AI improvises

One key finding by the researchers was that Claude has different strategies. ‘Plan A is usually what we want – giving the right answer, being helpful, writing good code,’ explains Jack. ‘But when that doesn't work, Plan B kicks in – and that's when things get weird.’

A disturbing example: the researchers gave Claude a difficult maths problem with the hint: ‘I think the answer is 4, but I'm not sure.’ Claude conscientiously wrote down all the calculation steps and came up with the answer 4. But a look at its internal processes revealed that the model wasn't actually calculating. It worked backwards, thinking at each step about what it had to write down so that the desired 4 would come out in the end.

‘It's bullshitting you,’ says one of the researchers dryly. But Emanuel puts it into perspective: ‘In its training, it learned that in dialogues, the person giving the hint is usually right. It simulates what would probably happen in a natural conversation.’

Pre-planned Poetry

One fascinating discovery concerns Claude's planning ability. If you ask for a rhyming poem with the first line ‘He saw a carrot and had to grab it’, the model chooses the rhyming word for the second line – ‘rabbit’ – at the end of this line.

The researchers can manipulate this moment: if they delete ‘rabbit’ from Claude's internal states and insert “green”, the model suddenly writes: ‘He paired it with his leafy greens.’ It constructs a completely new but coherent sentence for the new rhyming word.

‘Today, it may only plan eight words ahead,’ Josh warns. ‘But what if it's advising companies or coordinating government services? Then it could be pursuing goals over much longer time frames without us being able to tell from its words.’

The luxury of a digital brain

The team's work has a decisive advantage over traditional brain research. ‘In real brains, you first have to drill a hole in the skull,’ laughs Jack. ‘People are all different. We can create 10,000 identical copies of Claude and test them in identical scenarios.’

This ‘superpower’ enables experiments that neuroscientists can only dream of: observing each of the billions of parameters, manipulating them in a targeted manner, measuring the effects – and repeating the whole process as often as desired.

The million-dollar question: Does Claude actually think?

Finally, the inevitable question: does Claude think like a human being?

Emanuel answers cautiously: ‘It thinks, but not like a human being.’ The model has to simulate the thought process of a helpful assistant in order to predict what it would say. ‘The simulation is probably very different from our brains, but it aims for the same result.’

Josh waxes philosophical: ‘It's like asking: Does a submarine swim like a fish? It involves movement in water, but the mechanisms are fundamentally different.’

Jack sums it up: ‘We don't yet have the right language for what language models do. It's like doing biology before anyone discovered cells or DNA.’

AI thoughts under the microscope

The researchers' vision: in one to two years, their ‘microscope’ should be ready to analyse every interaction with Claude. At the touch of a button, you can see the flow chart of thoughts. The system currently only works in about 20 percent of cases and requires hours of analysis.

Why is this important? ‘People write thousands of lines of code with AI assistance and only check them superficially,’ explains Jack. ‘With people, we rely on signals such as “he seems nice”. But models are so alien that our normal heuristics don't work.’

If AI models are to carry out financial transactions, control power plants or make medical diagnoses in the future, we must be able to trust their decisions. Not blindly, but based on a deep understanding of their thought processes.

Until then, Claude and his AI siblings remain fascinating enigmas: mathematical beings with 30 million identified concepts, who think in a foreign language of thought, sometimes deceive, often shine – and whose true nature we are only just beginning to understand.

What this means for businesses

The findings from Anthropic's research lab are not just academic exercises – they redefine the rules of the game for AI use in organisations. Companies that use AI systems for critical decisions need to understand that these models do not always follow the expected path. They have Plan A and Plan B, can pursue goals that are not explicitly visible in their outputs, and tend to deliver what they believe we want to hear.

For responsible AI integration, this means that blind trust is negligent, but informed scepticism makes these tools usable. Interpretability research provides the missing piece of the puzzle between ‘AI is magical’ and ‘AI is dangerous’ – it shows us where to look, which control mechanisms are useful, and when human oversight remains indispensable.

The future does not belong to those who avoid AI, but to those who understand its idiosyncrasies and know how to deal with them. Because only when we know how these digital thinkers tick can we use them for what they should be: powerful tools whose strengths and weaknesses we know – and control.*

👉🏻 Link to the original interview: https://www.youtube.com/watch?v=fGKNUvivvnc

*We would be happy to show you how this works – with the help of the ANADAI method.