Activation probe

A probe is a linear classifier trained on the model's residual stream — the running sum that flows between transformer layers at a single token position. Green dots are labeled traces where the model was genuinely confident; red dots where it was confused and wrong. Gold dots mark the frontier: correct answers from an internally uncertain model — signal that output evaluation alone cannot capture.