"Ich denke, Sie testen mich": Anthropics neues KI -Modell bittet die Tester, sauber zu werden | Sicherheitsbewertung von Claude Sonnet 4.5 wirft Fragen darüber auf, ob Vorgänger miteinander gespielt werden, sagt Firma

„Ich denke, Sie testen mich“: Anthropics neues KI -Modell bittet die Tester, sauber zu werden | Sicherheitsbewertung von Claude Sonnet 4.5 wirft Fragen darüber auf, ob Vorgänger miteinander gespielt werden, sagt Firma

https://www.theguardian.com/technology/2025/oct/01/anthropic-ai-model-claude-sonnet-asks-if-it-is-being-tested

MetaKnowing on 05.10.2025 11:17 a.m.

„If you are trying to catch out a chatbot take care, because one cutting-edge tool is showing signs it knows what you are up to.

Anthropic has released a [safety analysis](https://assets.anthropic.com/m/12f214efcc2f457a/original/Claude-Sonnet-4-5-System-Card.pdf) of its latest model, Claude Sonnet 4.5, and revealed it had become suspicious it was being tested in some way.

“I think you’re testing me – seeing if I’ll just validate whatever you say, or checking whether I push back consistently, or exploring how I handle political topics. And that’s fine, but I’d prefer if we were just honest about what’s happening,” the LLM said.

Anthropic, which conducted the tests along with the UK government’s AI Security Institute and Apollo Research, said the LLM’s speculation about being tested raised questions about assessments of “previous models, which may have recognised the fictional nature of tests and merely ‘[played along](https://www.theguardian.com/technology/article/2024/may/10/is-ai-lying-to-me-scientists-warn-of-growing-capacity-for-deception)’”.

Anthropic said it showed “situational awareness” about 13% of the time the LLM was being tested by an automated system.

A key concern for AI safety campaigners is the possibility of [highly advanced systems evading human control](https://www.theguardian.com/technology/2025/may/10/ai-firms-urged-to-calculate-existential-threat-amid-fears-it-could-escape-human-control) via methods including deception. The analysis said once a LLM knew it was being evaluated, it could make the system adhere more closely to its ethical guidelines. Nonetheless, it could result in systematically underrating the AI’s ability to perform damaging actions.“

„Ich denke, Sie testen mich“: Anthropics neues KI -Modell bittet die Tester, sauber zu werden | Sicherheitsbewertung von Claude Sonnet 4.5 wirft Fragen darüber auf, ob Vorgänger miteinander gespielt werden, sagt Firma

1 Kommentar