Research

Study Shows AI Models Struggle with Slight Variations in Medical Tasks

Despite excelling in structured testing, AI models fail dramatically when faced with rephrased medical questions, casting doubt on clinical dependability and raising concerns about premature deployment in healthcare settings.

PMTLY Editorial Team Aug 23, 2025 4 min read Source: PsyPost

Key Research Findings

  • AI models falter dramatically with rephrased medical questions - up to 40% accuracy drop
  • Models rely on pattern recognition rather than genuine medical reasoning
  • Raises serious concerns about reliability in healthcare deployment
  • Study warns against premature clinical deployment without better evaluation
  • Researchers call for more robust evaluation frameworks for medical AI

AI Medical Reasoning Under Scrutiny

A groundbreaking study published in JAMA Network Open reveals critical limitations in AI medical reasoning, showing that leading AI models experience dramatic performance drops when medical questions are slightly rephrased. The research, conducted by Stanford University scientists, challenges the assumption that high test scores translate to reliable clinical reasoning.

"These AI models aren't as reliable as their test scores suggest. When we changed the answer choices slightly, performance dropped dramatically, with some models going from 80% accuracy down to 42%."

— Suhana Bedi, PhD Student, Stanford University

The findings suggest that current AI systems may rely heavily on pattern recognition rather than genuine medical reasoning, raising significant concerns about their readiness for clinical deployment. When familiar answer patterns were altered, even the most advanced models showed substantial performance degradation.

Research Motivation

The study addresses a critical gap between AI performance on standardized tests and real-world clinical reliability. While AI models achieve impressive scores on medical licensing exams, researchers found that less than 5% of papers evaluate these systems on actual patient data, which can be messy and fragmented.

Revolutionary Testing Methodology

Researchers developed a novel evaluation approach using the MedQA benchmark, selecting 100 multiple-choice medical questions and modifying 68 of them to replace correct answers with "None of the other answers" (NOTA). This subtle change forced models to rely on actual reasoning rather than recognizing familiar patterns.

Study Design Elements

Test Set Creation

68 modified medical questions with NOTA replacements

Clinician-reviewed for medical appropriateness

Models Tested

Six leading AI systems including GPT-4o, Claude 3.5 Sonnet

All prompted with chain-of-thought reasoning

Evaluation Method

Comparative analysis of original vs. modified questions

Statistical significance testing for accuracy drops

Clinical Relevance

Real diagnostic scenarios requiring step-by-step reasoning

Focus on treatment and diagnosis decisions

One example involved a newborn with inward-turning foot (metatarsus adductus), where "Reassurance" was the correct original answer. In the modified version, this option was removed and replaced with "None of the other answers," testing whether models could reason through the clinical scenario independently.

Dramatic Performance Declines Across All Models

The results were striking and consistent: all six AI models experienced significant accuracy drops when faced with the modified questions. The performance degradation ranged from concerning to catastrophic across different systems.

Performance Impact by Model

Most Resilient Models

DeepSeek-R1 & o3-mini: 9-16% accuracy drops

Still concerning but relatively better performance

Significant Declines

GPT-4o: Over 25% accuracy reduction

Widely used model showing major reliability issues

Severe Performance Loss

Claude 3.5 Sonnet: 33% accuracy drop

Popular model showing critical reasoning limitations

Catastrophic Failure

Llama 3.3-70B: Nearly 40% more incorrect answers

Largest performance degradation observed

"What surprised us most was the consistency of the performance decline across all models, including the most advanced reasoning models like DeepSeek-R1 and o3-mini."

— Suhana Bedi on universal performance degradation

The consistency of performance decline across all tested models suggests a fundamental limitation in current AI reasoning capabilities rather than isolated weaknesses in specific systems. Even models designed for enhanced reasoning failed to maintain reliability when patterns were altered.

Critical Implications for Healthcare

The findings have profound implications for AI deployment in clinical settings. If AI systems cannot handle minor variations in question formatting, they may struggle with the inherent variability of real-world medical practice, where patients present with overlapping symptoms, incomplete histories, and unexpected complications.

Healthcare Deployment Concerns

  • Pattern Dependence: Models rely on recognizing familiar formats rather than reasoning
  • Novel Situations: Poor performance when faced with unexpected clinical presentations
  • Reliability Questions: Inconsistent performance undermines clinical trust
  • Safety Implications: Potential for incorrect diagnoses or treatment recommendations

Educational Analogy

Researchers compare the phenomenon to "having a student who aces practice tests but fails when questions are worded differently." This highlights the difference between memorizing patterns and developing genuine understanding - a crucial distinction in medical decision-making.

"It's like having a student who aces practice tests but fails when the questions are worded differently. For now, AI should help doctors, not replace them."

— Suhana Bedi on AI limitations

Study Limitations and Research Priorities

While groundbreaking, the study acknowledges several limitations that provide context for the findings. The research team tested only 68 questions from one medical exam and used a specific methodology for evaluating reasoning capabilities.

"We only tested 68 questions from one medical exam, so this isn't the full picture of AI capabilities. Real clinical deployment would likely involve more sophisticated setups than what we tested."

— Suhana Bedi on study scope

Three Major Research Priorities

The authors identify three critical areas for future development: building evaluation tools that distinguish reasoning from pattern recognition, improving transparency around how systems handle novel medical problems, and developing models that prioritize genuine reasoning abilities over test performance.

Future Research Directions

Additional research should include larger, more diverse datasets and evaluation of alternative approaches such as retrieval-augmented generation or fine-tuning on clinical data. The goal is developing genuinely reliable AI for medical use rather than systems optimized for test performance.

Broader Context and Industry Impact

This study contributes to growing scrutiny of AI reliability in healthcare applications. Previous research has identified issues with AI hallucinations, inconsistent responses, and fabricated medical references, underscoring the need for rigorous evaluation before clinical deployment.

Responsible AI Development

The findings emphasize the critical importance of responsible AI development in healthcare, where errors can have life-threatening consequences. The research advocates for AI systems that assist rather than replace medical professionals, maintaining human oversight in clinical decision-making.

"Medicine is complicated and unpredictable, and we need AI systems that can handle that complexity safely. This research is about making sure we get there responsibly."

— Suhana Bedi on responsible AI development

Article Tags

AI Healthcare Medical AI AI Reliability Research Clinical AI Medical Reasoning AI Safety Healthcare Technology JAMA Study Stanford Research AI Limitations Medical Decision Making

Share This Article

Stay Updated with AI Insights

Get the latest AI news, exclusive prompts, and in-depth guides delivered weekly to your inbox