Wissenschaft

Studie zeigt, dass fast zwei Drittel der KI-generierten Zitate erfunden sind oder Fehler enthalten. Die mangelnde Zuverlässigkeit großer Sprachmodelle wie GPT-4o von OpenAI verdeutlicht ein erhebliches Risiko für die wissenschaftliche Forschung.

21.11.2025

View 4 Comments

4 Kommentare

mvea on 21.11.2025 12:22 a.m.

I’ve linked to the news release in the post above. In this comment, for those interested, here’s the link to the peer reviewed journal article:

https://mental.jmir.org/2025/1/e80371

From the linked article:

**Study finds nearly two-thirds of AI-generated citations are fabricated or contain errors**

A new investigation into the **reliability of advanced artificial intelligence models highlights a significant risk for scientific research**. The study, published in JMIR Mental Health, found that **large language models like OpenAI’s GPT-4o** frequently generate fabricated or inaccurate bibliographic citations, with these errors becoming more common when the AI is prompted on less familiar or highly specialized topics.

One of the known limitations of these models is a tendency to produce “hallucinations,” which are confident-sounding statements that are factually incorrect or entirely made up. In academic writing, a particularly problematic form of this is the fabrication of scientific citations, which are the bedrock of scholarly communication.

The analysis showed that across all six reviews, nearly one-fifth of the citations, 35 out of 176, were entirely fabricated. Of the 141 citations that corresponded to real publications, almost half contained at least one error, such as an incorrect digital object identifier, which is a unique code used to locate a specific article online. In total, nearly two-thirds of the references generated by the model were either invented or contained bibliographic mistakes.

The rate of citation fabrication was strongly linked to the topic. For major depressive disorder, the most well-researched condition, only 6 percent of citations were fabricated. In contrast, the fabrication rate rose sharply to 28 percent for binge eating disorder and 29 percent for body dysmorphic disorder. This suggests the AI is less reliable when generating references for subjects that are less prominent in its training data.
TERRADUDE on 21.11.2025 12:41 a.m.

Ive recently used ChatGPT for some research projects, asking for references along the way. When I’ve checked about half are either wrong or completely made up. I can deal with the wrong references but the made up references are very problematic.
hoyfish on 21.11.2025 12:47 a.m.

Now try it again with the latest models.

…and see the same damn issue.
jem0208 on 21.11.2025 12:47 a.m.

My experience with LLMs and citations is that they’re utterly useless when generated directly from the LLM – which is not at all surprising.

However, with online search enabled they’re really very good as an initial research tool. I wouldn’t use them for actually writing anything but for finding sources related to topic they can be very helpful.