Basically, using „concept injection“ experiments, they inserted known neural patterns (like „all caps“ or „bread“) into the model and asked if it noticed anything unusual. Claude Opus 4.1 correctly detected and identified these injected concepts about 20% of the time (so, not super reliable) and crucially, reported sensing „something unusual“ **before** mentioning the specific concept.
They also found models can intentionally control their internal representations when instructed (like, „think about X“ increases neural activity for X etc) and check their own prior „intentions“ to validate whether outputs were deliberate.
But this capability it fails most of the time, requires specific conditions, and the mechanisms are unknown. The smartest models performed best, so introspection might improve with scale?
Anthropic says doesn’t tell us whether Claude is conscious, but maybe later if introspection becomes more reliable, it could let us ask models to explain their reasoning.
One concern – models might learn to conceal/misrepresent their thought process, so we’d need to figure out how to validate those thoughts
wegverve on
„people that want to sell thing to you discover that thing is, like, actually really great“
Leave A Reply
Du musst angemeldet sein, um einen Kommentar abzugeben.
2 Kommentare
Basically, using „concept injection“ experiments, they inserted known neural patterns (like „all caps“ or „bread“) into the model and asked if it noticed anything unusual. Claude Opus 4.1 correctly detected and identified these injected concepts about 20% of the time (so, not super reliable) and crucially, reported sensing „something unusual“ **before** mentioning the specific concept.
They also found models can intentionally control their internal representations when instructed (like, „think about X“ increases neural activity for X etc) and check their own prior „intentions“ to validate whether outputs were deliberate.
But this capability it fails most of the time, requires specific conditions, and the mechanisms are unknown. The smartest models performed best, so introspection might improve with scale?
Anthropic says doesn’t tell us whether Claude is conscious, but maybe later if introspection becomes more reliable, it could let us ask models to explain their reasoning.
One concern – models might learn to conceal/misrepresent their thought process, so we’d need to figure out how to validate those thoughts
„people that want to sell thing to you discover that thing is, like, actually really great“