AI that makes life or death decisions should be interpretable

Last year, Boeing 787 crashed into a medical college in India killing 260 people. Investigators blame the pilots. The pilots' families blame Boeing and the final report has not been released. Meanwhile, US Department of War wants to use Anthropic’s AI model in fully autonomous weapons without human approval. Anthropic refused it, because their model is unreliable.

AI systems of today are nowhere near reliable enough to make fully autonomous weapons. Anyone who's worked with AI models understands that there's a basic unpredictability to them that in a purely technical way we have not solved.
— Dario Amodei, Anthropic CEO

The unreliability of AI models arises out of two problems:

AI is lossy: AI model breaks the input prompt into tokens (roughly equivalent to words). Then, its transformers reconstruct structure from fragments and loses accuracy in the process [1]. Due to this, the models can count characters incorrectly or generate wrong code. Now, they might kill innocent people in a small % of cases.
AI is a blackbox: When the model converts the broken-down tokens into an array of numbers (an embedding vector), we don't know what each of these numbers mean. Neither do we know how each of these numbers transforms into a different array of numbers as it goes through the layers of the neural network.

I have long argued that we have to reduce the loss in accuracy as much as possible. In this essay, we will see why solving the black box nature is equally important when it comes to life or death decisions like fully autonomous weapons or cancer detection. Let's consider this prompt you might give an AI:

I have a mole on my upper chest. It's been there for years but it looks darker recently, with some brownish patches and a bluish spot on one side. Should I be worried?

An AI model might analyze this and output:

The color variation you describe, particularly the bluish area and uneven brown tones, warrants dermatological evaluation.

How did the model get there? What happened between "brownish patches" and "bluish spot" and the assessment that these colors could be an indication of cancer?

Understanding the Black Box

The model splits your prompt into tokens and converts each one into a vector of 4,096 floating-point numbers:

Tokens	Embedding vector
I	[0.23, -0.45, 0.56, …]
have	[0.43, -0.76, 0.12, …]
…
brown	[0.52, -0.31, 0.43, …]
ish	[0.11, 0.45, -0.08, …]
…

What does 0.52 in the first position of "brown" mean? Nobody knows. Not the engineers who built the model. Not the researchers who trained it. Every single dimension is unnamed.

These vectors are stacked into a matrix, roughly 50 × 4,096, about 200,000 unnamed numbers. This passes through ~96 transformer layers. Early layers learn syntax. Middle layers learn that "brownish" describes the mole and that "one side" indicates asymmetry. Late layers combine color variation, asymmetry, and change over time into a risk assessment. Out comes the recommendation.

Dermatologists use the ABCDE rule: Asymmetry, Border, Color, Diameter, Evolving. C is Color. Multiple colors within a single lesion is one of the strongest indicators of cancer. The model almost certainly learned this. But which of the 4,096 dimensions encodes "color variance within a lesion"? Nobody can say. The path from "brownish patches and a bluish spot" to "warrants evaluation" is invisible. For color identification that can be life or death, invisible is not good enough.

Anthropic can find brown color as feature

In May 2024, Anthropic published Scaling Monosemanticity. Using sparse autoencoders, they decomposed the internal activations of Claude 3 Sonnet into interpretable components. They found millions of recognizable patterns: a "Golden Gate Bridge" feature, a "code bugs" feature, a "sycophantic praise" feature. In our prompt, they can find “brown” as a feature. These features are a combination of dimensions. For example, the dimensions could be: Red, Green, Blue. Every color could be a coordinate in the 3 dimensional RGB axis. Brownish could be (165, 120, 60). Bluish could be (40, 50, 120).

Anthropic found millions of features like brownish color. But, they did not find the dimensions, the underlying axes such as Red, Green, and Blue. And they did it backwards. They trained unnamed dimensions first and let millions of features get packed into them via superposition. If the model was trained on colors based on CMYK, then its unnamed dimensions could represent CMYK instead of RGB. But, what if we named the dimensions first?

Converting brown color into RGB

Two years before Anthropic's paper, researchers had already proved this was possible. In 2020, Şenel and colleagues at Koç University published a method that constrains embedding training so each dimension aligns with a named concept. During training, words associated with a concept are pushed to high values on that dimension. They found that naming the dimensions didn't hurt performance. The embeddings remained just as useful while becoming interpretable.

In 2022, the same group extended this with BiImp (bidirectional imparting). Each dimension gets two names, one per direction. A single dimension encodes "abstract" in the positive direction and "concrete" in the negative direction. Applied to our mole prompt, BiImp-style embeddings would look like:

"brownish" → { color_warm: 0.7, color_dark: 0.5, organic: 0.3 }

"bluish" → { color_cool: 0.8, color_dark: 0.6 }

Every dimension has a name. You can see that "brownish" and "bluish" share color_dark but differ on warm versus cool. A doctor reading this can begin to trace the model's reasoning. But BiImp used Roget's Thesaurus categories (roughly 1,000 concepts) as labels. In our framework, these are still features, not true dimensions. "Color_warm" is like "brownish," a position in a space. Red, Green, and Blue are the actual axes. BiImp proved naming works. It didn't push naming down to the primitive level.

RGB for every feature in the world

Neither group completed the picture:

Anthropic found features but from unnamed dimensions.
BiImp named dimensions but didn’t trace them to features.

The solution is to combine both: name the truly canonical and orthogonal dimensions, then let features emerge on top. Such dimensions already exist in the scientific literature. Color has 3 (Red, Green, Blue) from vision science. Taste has 5 (sweet, sour, salty, bitter, umami) from gustatory science. Emotion has 3 (valence, arousal, dominance) from psychology. Spatial dimensions from geometry. Sound from acoustics and so on [2]. With true named dimensions, the mole becomes fully traceable:

Now, if we run Anthropic's feature extraction on top, it might be readable by the construction of named dimensions. A doctor can see how the brownish color and asymmetric distribution is used in the AI's decision making process. Every step is interpretable.

Graph transformer to convert graph embedding

Named dimensions will change the shape of the current embedding vector. A flat array of 4,096 unnamed floats is a point in dense space. A sparse set of named dimensions is a graph: each token is a node, each named dimension is a typed edge.

Brownish

Representation

Current

[0.52, -0.31, 0.43, ..., 0.19]

flat array

Proposed

{
red: 0.65,
green: 0.47,

blue: 0.24

}

sparse graph

This graph embedding is the natural input for a graph transformer, where attention operates on typed edges between named nodes instead of matrix multiplications over floating point numbers. Incidentally, this graph transformer architecture can also be used for improving the accuracy by reducing loss in accuracy.

We can now trace the prompt into words, words into graphs of features/dimensions and see how each layer of the neural network transforms them into intermediate graphs till it reaches the answer. Both, the interpretability argument in this essay and the lossless representation argument in Let AI Speak in Its Mother Tongue are two sides of the same architecture. Named dimensions make embeddings interpretable and graph structure makes transformations lossless. Together: a graph embedding feeding a graph transformer, where every step from input to output is traceable.

Aftermath of a probabilistic error

Anthropic wants a human to be involved in the decision before the trigger is pulled. While, US Department of War says someone will be accountable after the fact. With human approval, they can catch the model's error before acting. Without it, the person overseeing can say: the model made the error. The blame shifts to the AI vendor. The vendor can say: we can't explain how the model arrived at this decision because the dimensions are unnamed. Nobody can trace the reasoning. Nobody is accountable in any meaningful way.

Boeing builds aircraft with deterministic, traceable engineering. Every bolt has a specification. Every system has a flight data recorder. The technology is fully interpretable. And still, the final report is not yet released. Now imagine Anthropic's AI model making targeting decisions in fully autonomous weapons. If it’s wrong 1% of the time and it kills an innocent person, we at least owe an explanation and assure that the mistake won’t be repeated. Can we do that with LLM that is probabilistic, not deterministic?

Anthropic was right to draw the line. But drawing the line is not enough. The model itself has to be interpretable. The mole on your chest deserves that. So does the person on the other end of a drone strike.

Notes:

The lossless argument is in Let AI Speak in Its Mother Tongue essay. The interpretability argument is in this essay. Both feed into the same architecture: graph embeddings for a graph transformer, where every step from input to output is traceable and no information is lost.
Anna Wierzbicka's Natural Semantic Metalanguage identifies ~65 semantic primes found in every documented human language: SOMEONE, SOMETHING, GOOD, BAD, BIG, SMALL, THINK, WANT, DO, HAPPEN, BEFORE, AFTER, BECAUSE, IF. They appear in every language ever studied, from Mandarin to Yankunytjatjara.

Developer forever