
Wissenschaftler haben eine Prüfung entwickelt, die so umfassend, anspruchsvoll und so tief im menschlichen Expertenwissen verwurzelt ist, dass aktuelle KI-Systeme sie durchweg nicht bestehen. „Die letzte Prüfung der Menschheit“ stellt 2.500 Fragen aus den Bereichen Mathematik, Geisteswissenschaften, Naturwissenschaften, alte Sprachen und hochspezialisierte Teilgebiete vor.
https://stories.tamu.edu/news/2026/02/25/dont-panic-humanitys-last-exam-has-begun/
7 Kommentare
When artificial intelligence systems began acing long‑standing academic assessments, researchers realized they had a problem: the tests were too easy.
Popular evaluations, such as the Massive Multitask Language Understanding (MMLU) exam, once considered formidable, are no longer challenging enough to meaningfully test advanced AI systems.
**To address this gap, a global consortium of nearly 1,000 researchers, including a Texas A&M University professor, created something different — an exam so broad, so challenging and so deeply rooted in expert human knowledge that current AI systems consistently fail it.**
**“Humanity’s Last Exam” (HLE) introduces a 2,500‑question assessment spanning mathematics, humanities, natural sciences, ancient languages and highly specialized subfields**. The team’s work is outlined in a paper published in Nature with documentation from the project available at lastexam.ai.
Early results showed that even the most advanced models struggled. GPT‑4o scored 2.7%; Claude 3.5 Sonnet reached 4.1%; OpenAI’s flagship o1 model achieved only 8%. The most advanced models, including Gemini 3.1 Pro and Claude Opus 4.6, have reached around 40% to 50% accuracy.
For those interested, here’s the link to the peer reviewed journal article:
https://www.nature.com/articles/s41586-025-09962-4
> Early results showed that even the most advanced models struggled. GPT‑4o scored 2.7%; Claude 3.5 Sonnet reached 4.1%; OpenAI’s flagship o1 model achieved only 8%. The most advanced models, including Gemini 3.1 Pro and Claude Opus 4.6, have reached around 40% to 50% accuracy.
That’s pretty good
Well you can already see that the advanced AI versions made huge gains. Matter of time before they ace the test
This is pretty old news, recent models are already getting around 40-50% on this. This benchmark will likely be saturated this year.
The benchmark has been in use for almost a year now and current-gen models are already getting >40% on it, see e.g. [https://deepmind.google/models/model-cards/gemini-3-1-pro/](https://deepmind.google/models/model-cards/gemini-3-1-pro/) with 44.4%. Take that as you will.
I understand that publishing journal papers is a fairly lengthy process, but the article would’ve made much more sense a year ago.
From the paper
> Before submission, each question is tested against state-of-the-art LLMs to verify its difficulty—questions are rejected if LLMs can answer them correctly.
This seems like a bit of a circular approach. The only questions on the test are ones that have been tested against LLMs and that the LLMs have already failed to answer correctly. It’s certainly interesting as it shows where the limits of the current crop of LLMs are, but even in the paper they say that this is unlikely to last and previous LLMs have gone from near zero to near perfect scores in tests like this in a relatively short timeframe.
“We call it Voight Kampff for short”