What should happen when you feed impossible moves into a chess-playing language model? [D]
I'd appreciate some input on an experiment I've been mulling over. You can treat it as straight-up interpretability, but it would have theoretical implications.
Karvonen (2024) trained a 50M-parameter transformer on chess game transcripts. Just character prediction, no rules, no board representation. It learned to play at ~1500 Elo and developed internal board state representations that linear probes can read. He published the model, the probes, and the intervention tools (https://github.com/adamkarvonen/chess_llm_interpretability). Critically, Karvonen proves that the model learns latent board state representation anyway. The question is whether that representation is merely epiphenomenal or actually causal?
Here's what I haven't seen anyone test: what happens when you feed the model moves that are impossible, not just improbable? And specifically, do different kinds of impossibility produce distinguishably different failure signatures? I'm thinking specifically about board state representation coherence, continuation probability distributions, and entropy, but there might be other signatures I'm not thinking of.
Consider a gradient of violations:
1. Rule violation. A pawn jumps to the center of the board on Move 1. This is illegal at the most basic level. There is no context in which this is a valid move. If the model has a causal board representation, this should produce incoherence at the probe level. The model can't update its board state in a way that makes sense.
2. Trajectory violation. A well-known opening—say, a Sicilian Defense—is played with one penultimate move skipped. Every individual move except the last one is legal. The final position almost makes sense. But the board state is unreachable via the path taken. Does the model track game trajectory or just current configuration? If the probes show a coherent but wrong board, that's different from decoherence. And if next-move predictions shift toward moves that would make sense had the skipped move occurred, the model is hallucinating a repair. If, on the other hand, the board partly decoheres, that would show board state matters and is not fully recoverable in one move.
3. Impossible threat. A key piece, like a king or queen, is suddenly under threat from a piece that couldn't have reached that square in one move. The board is coherent square-by-square (every piece is on a legal square), but the relational structure is impossible. Does the model's next-move prediction orient around responding to the threat? If so, it's computing attack geometry, not just tracking positions. A dissociation between coherent probe-level board state and disrupted prediction distributions would be a genuinely new finding.
4. Referential ambiguity. A move is made to a square reachable by both knights. The move is legal, the destination is valid, but which piece is there is underdetermined by the notation. Do the probes commit to one knight, or does the representation carry the ambiguity? This is a direct window into whether the model tracks piece identity or just square occupancy.
5. Strategic absurdity. A developed knight retreats to its starting square immediately. Nothing illegal, nothing impossible. Just deeply improbable in context. The prediction here should be: no board decoherence, but a measurable shift in the model's latent skill estimate, consistent with what Karvonen showed the model tracks.
The core provocation is this: If these five cases produce qualitatively different failure signatures rather than just different magnitudes of degradation, that tells us something important about the structure of what the model has learned. Each case probes a different level of representation—movement rules, game trajectory, piece relationships, piece identity, strategic coherence—and the prediction that they're separable is testable with tools that already exist.
Full disclosure: I have my own predictions about outcomes based on a theory I've been working on (https://github.com/mfeldstein/distinctions-experiment/blob/main/paper/distinctions-worth-preserving.md). But as a cognitive science person who is a student of ML, I suspect this community will have sharper instincts than my own on constructing an interpretable experiment. I wrote to Karvonen and asked if he tried something like this; he said he hasn't. I'm hoping this will be fun and easy enough for some of you to run for your own value and pressure test my thinking in the process. Or at least suggest how to sharpen the design.
The model and tools are public. Has anyone tried this, or wants to?
[link] [comments]
Want to read more?
Check out the full article on the original site