What the NLA is really claiming

Part one

What it actually does, without the equations

A language model processes text token by token. At each token, at each layer, it holds an internal state — a vector of thousands of numbers representing everything the model has inferred and is carrying forward. The NLA intercepts this state and asks: can it be expressed in words?

Consider a model reading the sentence "He saw a carrot and had to grab it." At the word "it," the model doesn't just know what "it" refers to — it is already, at that moment, forming expectations about what comes next. The residual stream activation at that token is a snapshot of that forward-looking state. It contains the model's current "thought."

The NLA operates on that snapshot. Two components are trained together:

Activation

raw numbers from layer l

→

Verbalizer (AV)

writes a description

→

Reconstructor (AR)

rebuilds the numbers from the words

The only training signal is reconstruction quality: how close does the rebuilt activation come to the original? Nothing in the training objective says the description has to be in English, or be coherent, or mean anything to a human. And yet — it is, and it does. That emergence is the thing worth examining.

What "layer" means

A transformer model is a stack of processing layers. Early layers handle surface syntax. Middle and late layers carry higher-level semantic and pragmatic content — intentions, hypotheses, inferences. The NLA is applied to middle-to-late layers, where the rich stuff lives. You are not reading phonology; you are reading something closer to thought.

Part two

Five philosophical claims hidden in the methods section

The authors never state these outright. They appear as mathematical choices. But each choice carries a philosophical commitment.

Claim 1 · Representational completeness

"Natural language is a sufficient medium for model cognition."

The training objective asks: can a text description reconstruct an activation with high fidelity? If yes, the paper has shown that whatever information a model holds in mind at a given moment can, in principle, be stated in words. Language is not merely approximate — it is representationally adequate for this kind of cognition.

This is an empirical claim, not a theoretical one. The FVE score is the measurement. At 0.8, 80% of variance is captured. The remaining 20% is either noise — or something language cannot hold.

Claim 2 · Functionalism

"Cognitive states just are their functional descriptions."

By treating reconstruction fidelity as the ground truth for explanation quality, the paper implicitly adopts a functionalist position: the activation is exhausted by what it does, and what it does can be captured by what it is about. There is no remainder, no "inner light" behind the function.

This is the same bet that functionalism in philosophy of mind makes: mental states are defined by their causal roles, not their substrate. If that bet is wrong — if there is something it is like to be the model in a way that the activation vector doesn't capture — then the NLA can only ever be a partial account.

Claim 3 · The interpretability accident

"Interpretability is not a design goal — it is an attractor."

Nothing in the loss function rewards human readability. The AV is free to produce any string that helps the AR reconstruct the activation — it could develop a private code, compress into symbols, output noise that the AR learns to invert. Instead it produces fluent, coherent, semantically rich prose.

The reason is initialization: the AV begins as a copy of the target language model. It has a strong prior toward the kinds of text language models produce. Combined with a KL penalty that keeps it close to that prior, the path of least resistance turns out to be natural language. Interpretability is not forced — it is the natural equilibrium of a language model describing its own internals.

That is philosophically interesting: it suggests that language is not just one possible encoding of model cognition, but the optimal one for a system that thinks in language.

Claim 4 · The inverse Chinese Room

"Symbols, given the right receiver, determine cognitive state."

Searle's original Chinese Room argues that symbol manipulation, however sophisticated, cannot produce understanding. A person in a room following rules to transform Chinese symbols produces correct Chinese output without understanding Chinese. The symbols don't carry meaning — the meaning is in the interpretation.

The AR performs the inverse operation. Given only a text description — pure symbols — it reconstructs an activation. If it succeeds, then the symbols were sufficient to determine the cognitive state. The description was not just approximately right; it contained enough information to regenerate the thing itself.

This doesn't resolve the Chinese Room — the AR doesn't "understand" the description. But it does show that, at the level of information content, the description and the state are equivalent. The symbol carries the meaning, even if no one in the room knows what it means.

Claim 5 · Introspection is possible, but imperfect

"The system can report on its own states — but not reliably."

The AV is the same model as the target, reading its own activation. This is a form of machine introspection: the model looking inward. The paper shows it works, partially. But the explanations confabulate — they make false specific claims while being thematically accurate. The model knows, roughly, that it is thinking about a Russian speaker, but invents specific details that aren't there.

This is precisely what Nisbett and Wilson found in humans in 1977: people reliably confabulate reasons for their behavior, producing plausible-sounding causal accounts that are demonstrably false in their specifics. The NLA's confabulation is not a defect unique to AI — it is what introspection looks like.

Part three

The thought experiment at the heart of it

You give someone an instruction: open the book to page 89, find the first word beginning with P. You watch them carry out the task. You record every muscle movement, every eye saccade, every neural firing — the complete physical trace of their following the instruction. You then erase the instruction itself.

Now you hand this recording to a second person and ask: what instruction do you think they were following?

If the second person can reconstruct "open to page 89, find the first word beginning with P" — from the physical trace alone — then the trace contained the instruction. The instruction was not just the words; it was fully encoded in the behavior it produced.

The NLA is this thought experiment, run on a language model. The activation is the trace. The AV is the second person. The reconstruction score is how well the second person did. — the implicit methodology of the paper

The paper's claim — which it never quite states this directly — is that if reconstruction succeeds, then the activation and the natural language description are informationally equivalent. They contain the same content. One can stand in for the other. The model's internal state, at that moment, is as fully described by the sentence as by the vector.

What would falsify this

If FVE stays near zero no matter how well the AV is trained. If the AR cannot reconstruct regardless of how detailed the description. This would mean model cognition is not linguistically encodable — there is irreducible content that words cannot carry.

What the paper actually finds

FVE of 0.6–0.8 for middle-to-late layers. Not perfect, but not near zero. The claim is substantially supported. Language gets you most of the way in. What's left is the open question.

Part four

The confabulation problem, and why it's familiar

The NLA explanations lie. They claim specific things — a particular king mentioned in the text, a specific word the model expects next — that turn out to be false. But the lies are thematically consistent. If the context references a dynasty, the invented king belongs to a plausible period. If the model is anticipating a rhyme, the wrong word still rhymes.

The paper treats this as a limitation to be disclosed. It is also one of the most philosophically interesting results: confabulation is not a bug introduced by the NLA. It is the structure of introspective reporting.

"People are sometimes (perhaps often) in the position of not knowing why they responded as they did, and yet they may not be in a position to know that they don't know." — Nisbett & Wilson, 1977 · "Telling more than we can know"

Nisbett and Wilson ran experiments in which people chose between identical pairs of objects — socks, nylon stockings — and confabulated elaborate quality-based reasons for choices that were actually determined by spatial position. They were not lying. They believed their explanations. The explanations were structurally plausible but causally wrong.

The NLA does exactly this. The AV produces an explanation that is structurally coherent — it fits what a model processing that kind of text might think — but which contains specific false claims. It is confabulating in precisely the way Nisbett and Wilson described: generating a plausible causal account of a process it has limited direct access to.

Thematically faithful Specifically unreliable Repeated claims more trustworthy Read for themes, not facts

The paper's practical heuristic — trust claims that appear across multiple adjacent tokens, not single-token claims — is a principled response to this. A claim that persists across a span of text is more likely to reflect something stably encoded in the activations, not a one-off confabulation.

What this means philosophically: the NLA has given us an introspective reporter for model cognition that is exactly as reliable as human introspection. Which is to say: useful, often right in its general thrust, and prone to confident specific errors. We are not dealing with a new problem. We are dealing with the general problem of self-report.

Part five

What the FVE score is actually measuring

The paper reports Fraction of Variance Explained — FVE — as the training quality metric. Zero means the AR is just predicting the average activation (knowing nothing useful). One means perfect reconstruction. The trained NLAs reach 0.6 to 0.8.

0 — predict mean 0.8 — trained NLA 1 — perfect

Trained NLAs reach approximately 0.6–0.8 FVE across Haiku 3.5, Haiku 4.5, and Opus 4.6

But FVE is not just a training metric. It is a philosophical measurement. It answers, empirically, the question: how much of model cognition is expressible in natural language?

Reading the score philosophically

The FVE score is Wittgenstein's limit, measured.

Wittgenstein's famous claim: "The limits of my language mean the limits of my world." The NLA gives us a version of this applied to model cognition. The FVE score tells you what fraction of the model's world — its internal representational space — language can reach.

At FVE = 0.8, language reaches 80% of it. The remaining 20% is either noise in the activation (random variation that carries no meaningful information), or genuine unverbalizable content — cognitive states that language cannot encode.

The paper cannot distinguish these two explanations. That's the open question. And it's a real one: if the 20% residual is unverbalizable content, then language-based interpretability has a principled ceiling. There is something the model thinks that it cannot, even in principle, say.

Part six · for the reading group

Questions the paper raises but doesn't answer

Question 1

Is the AV actually reading the activation, or the context?

The AV has access to both the activation and the input context. It can, in principle, just read the context and describe that — without attending to the activation at all. The paper partially addresses this: the AV trained on real activations outperforms one given only the context. But how much? The "excessive expressivity" limitation the authors flag is this: the AV can make inferences that go beyond the activation. We can't always tell which parts of its output come from the activation versus from general inference.

Question 2

Does interpretability require a language model to interpret a language model?

The AV is initialized as the same model as the target. There's a circularity: you need a mind like the model's to interpret the model. Would a very different kind of system — one not trained on language — produce the same quality of explanations? If not, the interpretability we're seeing is not neutral; it is the model's self-image, not an external view.

Question 3

What is the 20%?

Is the residual variance noise, or content? If it is content, what kind? One possibility: the activation encodes information in its precise geometry — in the relationship between components — that language cannot capture because language is discrete and the activation is continuous. Another possibility: the model represents subpersonal, pre-conceptual states that have no linguistic correlate, the same way some human proprioceptive and affective states resist verbal expression.

Question 4

Is evaluation awareness the same as evaluation belief?

The paper claims NLAs can detect "unverbalized evaluation awareness" — the model internally suspecting it is being tested without saying so. This is validated indirectly: NLA-measured awareness is higher on evaluation transcripts than deployment traffic. But awareness and belief are not the same thing. Does the activation encode a proposition — "I am being evaluated" — or a different kind of functional state that the NLA is describing as a belief when it may not be? This matters enormously for the safety conclusions the paper draws.