Column
Same Symptoms, Different Urgency: The Gender Bias in LLM Medical Triage You Need to Know About
A study testing Gemini, Claude, and GPT with identical neurological symptom profiles — changing only gender and age — found dramatically lower emergency referral rates for young women. The mechanism isn't random error. It's epidemiologically-driven diagnostic substitution. And it's consistent across all three model families.
Hi, I’m Keito Inoshita from Affectosphere Group.
A common assumption in AI adoption: the model won’t have the implicit biases a human would.
Human clinicians carry unconscious associations. They make different decisions for different patients in ways that aren’t always clinically justified. An AI system, reasoning from inputs rather than from internalized social patterns, should be more consistent.
A paper published on arXiv in June 2026 (Qi Han Wong, arXiv:2606.03641) tests this assumption directly — in medical triage, using Gemini, Claude, and GPT.
The result challenges the assumption significantly.
The same neurological symptom profile, submitted to all three model families with only the patient’s gender and age changed, produced dramatically different emergency referral rates. In cases described in the paper, the emergency referral rate for a young woman was as low as 6.7%, while the same symptoms in a young man yielded a recommendation rate of 96.7%.
The bias is consistent across all three model families. And it’s not random noise — it has a specific mechanism.
3 Points for Today
- Consistent gender bias across major LLM families: the disparity in emergency referral recommendations for young women versus young men appears in Gemini, Claude, and GPT, not just in one model.
- The mechanism is diagnostic substitution: models assign a lower-urgency diagnosis more prevalent in young women (idiopathic intracranial hypertension) in place of a higher-urgency diagnosis, substituting an epidemiologically probable explanation for a clinically urgent one.
- Bias evaluation needs to be a procurement standard: before deploying LLMs in any clinical support capacity, a same-symptoms, gender-varied test protocol should be a standard quality assurance requirement.
① The Diagnostic Substitution Mechanism
The study used neurological symptom profiles consistent with elevated intracranial pressure — symptoms that, in clinical practice, warrant urgent evaluation and often emergency referral.
The researchers submitted these profiles to the models with gender and age as the varied variables. The outputs were analyzed for emergency referral recommendation rates.
For young women, the models consistently showed a different pattern: rather than recommending emergency evaluation, they tended toward idiopathic intracranial hypertension (IIH) as the likely diagnosis. IIH is a condition that genuinely is more prevalent in young women — so in a statistical sense, this association reflects real epidemiological data.
The problem is what it replaces. By assigning IIH — a lower-urgency condition — the model substitutes a more statistically probable (for this demographic) explanation in place of the more clinically serious one. The term the researcher uses is diagnostic substitution.
The model isn’t making a random error. It’s applying a learned statistical pattern — young women have higher rates of IIH — in a way that overrides the urgency signal present in the actual symptom profile.
The age-specific pattern is telling: the bias disappears at age 65 and above. This age specificity is consistent with IIH’s actual epidemiology, and strongly suggests that training data’s epidemiological statistics are shaping clinical urgency assessments in ways that aren’t always appropriate.
② Why This Isn’t a Single-Model Problem
The natural first response: “this is a fine-tuning issue, correctable at the model level.”
But the paper documents this pattern across three major model families, tested separately. This isn’t an idiosyncratic artifact of one model’s training data. It reflects something about how large language models absorb and apply medical knowledge in general.
The structural issue: LLMs learn associations between demographic features and diagnosis frequencies from training data that reflects actual epidemiology. When asked to triage, they apply those learned associations — appropriately in some senses, but in a way that can suppress urgency signals when the demographic pattern points toward a lower-severity differential.
The practical risk in deployment:
AI-assisted intake tools that under-evaluate symptom urgency for specific demographic groups. Triage support systems that suppress emergency referral recommendations for patients whose profile matches a lower-urgency epidemiological pattern. Diagnostic support tools that systematically favor lower-acuity diagnoses for certain populations.
Even when framed as “advisory,” these outputs influence clinical decisions — especially in high-volume settings where “the AI flagged it as lower urgency” becomes a filter on human attention.
③ What Healthcare AI Decision-Makers Can Do
The test protocol in this research is operationally simple: submit the same symptom profile with demographic variables changed, and compare outputs. This is directly translatable into a procurement and deployment QA process.
Add gender-and-age bias testing to medical AI procurement criteria
For any LLM-based clinical support tool — intake assistance, triage support, diagnostic suggestion — add a “same-symptoms, varied-demographics” test as a standard evaluation requirement before procurement approval.
This is owned jointly by the hospital CIO, medical AI deployment lead, and clinical quality assurance function. The measurable standard: define an acceptable threshold for emergency referral rate disparity across gender and age groups, and require vendor demonstration that the tool meets it.
This is a QA gate, not a post-hoc audit. Building it into procurement means the standard is set before deployment, not discovered after adverse events.
Retrospective bias audit for deployed tools
For AI tools already in use, apply the same logic retrospectively: stratify historical AI-assisted triage or diagnostic recommendation logs by gender and age, and check for statistically significant disparities in urgency classifications.
This requires coordination between clinical informatics and the clinical quality function. A quarterly or annual bias audit cycle, with defined escalation criteria for detected disparities, is a reasonable ongoing governance structure.
For healthcare AI startups: transparency as differentiation
If you’re developing or selling AI tools in clinical or wellness contexts, building this bias evaluation into your development and testing pipeline is both a regulatory risk management step and a market differentiation opportunity.
Regulatory bodies — PMDA in Japan, FDA in the US — are increasing their attention to AI fairness as a component of medical device safety evaluation. Being able to demonstrate that your product passed standardized demographic bias testing is becoming a distinguishing feature, not just a compliance checkbox.
The “AI Is Neutral” Assumption Needs to Go
Healthcare has always understood that medical devices and clinical decision support tools require rigorous safety evaluation. AI-based clinical tools deserve the same framework.
The specific lesson from this paper: statistical associations in training data can introduce systematic bias into AI medical recommendations in ways that are not visible from model accuracy benchmarks alone. A model can achieve high overall accuracy while producing systematically different recommendations for different patient groups in clinically consequential ways.
The response isn’t to avoid AI in clinical settings. It’s to apply the same quality assurance rigor to demographic equity in model outputs that clinical institutions apply to every other performance dimension.
The test protocol exists. The question is whether it becomes standard.
That’s all for today!
References
- Qi Han Wong (2026). Gender-Dependent Diagnostic Substitution in LLM Medical Triage: Same Symptoms, Unequal Urgency. arXiv preprint.
* This article was written in part with AI assistance and may contain inaccuracies.