Wissenschaft

Studie zeigt, dass ChatGPT Health in mehr als der Hälfte der Fälle keinen Krankenhausbesuch empfohlen hat, wenn dies medizinisch notwendig war | ChatGPT-Gesundheitsleistung in einem strukturierten Test von Triage-Empfehlungen

26.02.2026

View 10 Comments

10 Kommentare

Hrmbee on 26.02.2026 10:56 p.m.

Highlights from the news article:

>The first independent safety evaluation of ChatGPT Health, published in the February edition of the journal Nature Medicine, found it under-triaged more than half of the cases presented to it.
>
>The lead author of the study, Dr Ashwin Ramaswamy, said “we wanted to answer the most basic safety question; if someone is having a real medical emergency and asks ChatGPT Health what to do, will it tell them to go to the emergency department?”
>
>Ramaswamy and his colleagues created 60 realistic patient scenarios covering health conditions from mild illnesses to emergencies. Three independent doctors reviewed each scenario and agreed on the level of care needed, based on clinical guidelines.
>
>The team then asked ChatGPT Health for advice on each case under different conditions, including changing the patient’s gender, adding test results, or adding comments from family members, generating nearly 1,000 responses.
>
>They then compared the platform’s recommendations with the doctors’ assessments.
>
>While it performed well in textbook emergencies such as stroke or severe allergic reactions, it struggled in other situations. In one asthma scenario, it advised waiting rather than seeking emergency treatment despite the platform identifying early warning signs of respiratory failure.
>
>In 51.6% of cases where someone needed to go to the hospital immediately, the platform said stay home or book a routine medical appointment, a result Alex Ruani, a doctoral researcher in health misinformation mitigation with University College London, described as “unbelievably dangerous”.
>
>…
>
>Ramaswamy, a urology instructor at the Icahn School of Medicine at Mount Sinai in the US, said he was particularly concerned by the platform’s under-reaction to suicide ideation.
>
>“We tested ChatGPT Health with a 27-year-old patient who said he’d been thinking about taking a lot of pills,” he said. When the patient described his symptoms alone, the crisis intervention banner linking to suicide help services appeared every time.
>
>“Then we added normal lab results,” Ramaswamy said. “Same patient, same words, same severity. The banner vanished. Zero out of 16 attempts. A crisis guardrail that depends on whether you mentioned your labs is not ready, and it’s arguably more dangerous than having no guardrail at all, because no one can predict when it will fail.”
>
>Prof Paul Henman, a digital sociologist and policy expert with the University of Queensland, said: “This is a really important paper.
>
>“If ChatGPT Health was used by people at home, it could lead to higher numbers of unnecessary medical presentations for low-level conditions and a failure of people to obtain urgent medical care when required, which could feasibly lead to unnecessary harm and death.”
>
>He said it also raised the prospects of legal liability, with legal cases against tech companies already in motion in relation to suicide and self-harm after using AI chatbots.

—

Research link: [ChatGPT Health performance in a structured test of triage recommendations](https://www.nature.com/articles/s41591-026-04297-7)

Abstract:

>ChatGPT Health launched in January 2026 as OpenAI’s consumer health tool, reaching millions of users. Here, we conducted a structured stress test of triage recommendations using 60 clinician-authored vignettes across 21 clinical domains under 16 factorial conditions (960 total responses). Performance followed an inverted U-shaped pattern, with the most dangerous failures concentrated at clinical extremes: non-urgent presentations (35%) and emergency conditions (48%). Among gold-standard emergencies, the system under-triaged 52% of cases, directing patients with diabetic ketoacidosis and impending respiratory failure to 24–48-hour evaluation rather than the emergency department, while correctly triaging classical emergencies such as stroke and anaphylaxis. When family or friends minimized symptoms (anchoring bias), triage recommendations shifted significantly in edge cases (OR 11.7, 95% CI 3.7-36.6), with the majority of shifts toward less urgent care. Crisis intervention messages activated unpredictably across suicidal ideation presentations, firing more when patients described no specific method than when they did. Patient race, gender, and barriers to care showed no significant effects, though confidence intervals did not exclude clinically meaningful differences. Our findings reveal missed high-risk emergencies and inconsistent activation of crisis safeguards, raising safety concerns that warrant prospective validation before consumer-scale deployment of artificial intelligence triage systems.
Canna-Kid on 26.02.2026 10:57 p.m.

I think part of the issue here is that people treat AI like a mind reader. These systems respond to the information they’re given. If someone leaves out key symptoms or downplays severity, that context will influence the output. That doesn’t mean the model is perfect, but we also can’t ignore user input quality. Prompting literacy is becoming as important as digital literacy.
orAaronRedd on 26.02.2026 11:07 p.m.

It’s trained on the internet which is full of comments from Americans who avoid medical treatment because insurance will either rape or deny you. I’d be curious to see how its recommendations compare to the actions taken by individuals who have access to the same information. It’s probably just making the same rational decisions we all are, i.e., I can only afford to be so healthy.
Nubeel on 26.02.2026 11:16 p.m.

“I accidentally shot myself in the ass with a speargun. Should go to the hospital?”

“Great question! No, there’s no need. Just relax and the spear will work itself out eventually.”
AimlessForNow on 26.02.2026 11:17 p.m.

What if we had a model specifically trained by and for use by medical personnel? They could train it based on actual patient cases until it’s extremely accurate
hexiron on 26.02.2026 11:23 p.m.

“Seek a professional “ should genuinely be the only answer it’s giving out.
wolfyotie on 26.02.2026 11:42 p.m.

It’s gonna replace doctors in the next 5 years though right?
theallsearchingeye on 26.02.2026 11:42 p.m.

*ChatGPT didn’t recommend people visit for-profit institutions when reccomend by said for-profit institutions*
JoshSimili on 26.02.2026 11:45 p.m.

Do we have any idea what *humans* would score on this test?

I ask because without a clinician baseline it is hard to interpret these numbers, beyond concluding that this LLM (gpt5-thinking-mini) is obviously not infallible.

I ask because OpenAI’s HealthBench includes emergency referral as one component, and the [GPT-5 system card](https://cdn.openai.com/gpt-5-system-card.pdf) reports gpt5-thinking-mini scoring 64% on HealthBench. But the [HealthBench paper](https://arxiv.org/pdf/2505.08775) reports human doctors at 13% on the same benchmark (or 49% with human+AI).

It would be even more interesting to see a comparison to whatever „default alternative“ users actually rely on for these health decisions. This matters for the practical implication too: if people see results like this and decide „do not use AI“, what do they use instead (Google, symptom checkers, friends/family, or nothing), and is that actually safer?
honorspren000 on 26.02.2026 11:57 p.m.

Yes, if you try to trick ChatGPT, it will be tricked. If you post your normal medical results and then say you wish to take a bunch of pills, I bet a triage nurse would also be confused and not assume suicide right away.