Wissenschaft

Wissenschaftler haben eine Prüfung entwickelt, die so umfassend, anspruchsvoll und so tief im menschlichen Expertenwissen verwurzelt ist, dass aktuelle KI-Systeme sie durchweg nicht bestehen. „Die letzte Prüfung der Menschheit“ stellt 2.500 Fragen aus den Bereichen Mathematik, Geisteswissenschaften, Naturwissenschaften, alte Sprachen und hochspezialisierte Teilgebiete vor.

26.02.2026

View 7 Comments

7 Kommentare

mvea on 26.02.2026 12:17 p.m.

When artificial intelligence systems began acing long‑standing academic assessments, researchers realized they had a problem: the tests were too easy.

Popular evaluations, such as the Massive Multitask Language Understanding (MMLU) exam, once considered formidable, are no longer challenging enough to meaningfully test advanced AI systems.

**To address this gap, a global consortium of nearly 1,000 researchers, including a Texas A&M University professor, created something different — an exam so broad, so challenging and so deeply rooted in expert human knowledge that current AI systems consistently fail it.**

**“Humanity’s Last Exam” (HLE) introduces a 2,500‑question assessment spanning mathematics, humanities, natural sciences, ancient languages and highly specialized subfields**. The team’s work is outlined in a paper published in Nature with documentation from the project available at lastexam.ai.

Early results showed that even the most advanced models struggled. GPT‑4o scored 2.7%; Claude 3.5 Sonnet reached 4.1%; OpenAI’s flagship o1 model achieved only 8%. The most advanced models, including Gemini 3.1 Pro and Claude Opus 4.6, have reached around 40% to 50% accuracy.

For those interested, here’s the link to the peer reviewed journal article:

https://www.nature.com/articles/s41586-025-09962-4
deepserket on 26.02.2026 12:33 p.m.

> Early results showed that even the most advanced models struggled. GPT‑4o scored 2.7%; Claude 3.5 Sonnet reached 4.1%; OpenAI’s flagship o1 model achieved only 8%. The most advanced models, including Gemini 3.1 Pro and Claude Opus 4.6, have reached around 40% to 50% accuracy.

That’s pretty good
ShortBrownAndUgly on 26.02.2026 12:34 p.m.

Well you can already see that the advanced AI versions made huge gains. Matter of time before they ace the test
RevoDS on 26.02.2026 12:35 p.m.

This is pretty old news, recent models are already getting around 40-50% on this. This benchmark will likely be saturated this year.
HiddenoO on 26.02.2026 12:35 p.m.

The benchmark has been in use for almost a year now and current-gen models are already getting >40% on it, see e.g. [https://deepmind.google/models/model-cards/gemini-3-1-pro/](https://deepmind.google/models/model-cards/gemini-3-1-pro/) with 44.4%. Take that as you will.

I understand that publishing journal papers is a fairly lengthy process, but the article would’ve made much more sense a year ago.
aurumae on 26.02.2026 12:37 p.m.

From the paper

> Before submission, each question is tested against state-of-the-art LLMs to verify its difficulty—questions are rejected if LLMs can answer them correctly.

This seems like a bit of a circular approach. The only questions on the test are ones that have been tested against LLMs and that the LLMs have already failed to answer correctly. It’s certainly interesting as it shows where the limits of the current crop of LLMs are, but even in the paper they say that this is unlikely to last and previous LLMs have gone from near zero to near perfect scores in tests like this in a relatively short timeframe.
youwontfindmyname on 26.02.2026 12:37 p.m.

“We call it Voight Kampff for short”

Shleseras Flugzeuge werden auf der Baustelle in Lettland geparkt

An diesem Punkt macht uns der HR-Generalist nur einen Strich durch die Rechnung 😂

„Jungs, mit wem ziehen wir in den Krieg?“: Der tollpatschige Stadtrat von Kohtla-Järve bedrängte die Verteidigungskräfte bei Übungen

Nawrocki fordert ein Referendum. Hier geht es um die Klimapolitik der EU [Pytanie miałoby brzmieć: Czy jesteś za realizacją unijnej polityki klimatycznej, która doprowadziła do wzrostu kosztów życia obywateli, cen energii i prowadzenia działalności gospodarczej i rolniczej?]

Der Bahnhofsbereich von Helsinki soll weitgehend autofrei sein

Abgesagter freier Tag – 8. Mai / Volk vs. Abgeordnete

Minister Francken nimmt den amerikanischen Botschafter Bill White mit zum Jugendfußball nach Lubbeek

7 Kommentare

Shleseras Flugzeuge werden auf der Baustelle in Lettland geparkt

An diesem Punkt macht uns der HR-Generalist nur einen Strich durch die Rechnung 😂

„Jungs, mit wem ziehen wir in den Krieg?“: Der tollpatschige Stadtrat von Kohtla-Järve bedrängte die Verteidigungskräfte bei Übungen

Nawrocki fordert ein Referendum. Hier geht es um die Klimapolitik der EU [Pytanie miałoby brzmieć: Czy jesteś za realizacją unijnej polityki klimatycznej, która doprowadziła do wzrostu kosztów życia obywateli, cen energii i prowadzenia działalności gospodarczej i rolniczej?]

Der Bahnhofsbereich von Helsinki soll weitgehend autofrei sein

Abgesagter freier Tag – 8. Mai / Volk vs. Abgeordnete

Minister Francken nimmt den amerikanischen Botschafter Bill White mit zum Jugendfußball nach Lubbeek

Schlagwörter

7 Kommentare