New research reveals that leading AI chatbots have an 80% failure rate in early-stage medical diagnosis, as the models struggle to process incomplete patient data compared to final-stage results.
Leading AI chatbots, including models from OpenAI, Google, and DeepSeek, fail to provide accurate medical diagnoses in over 80% of early-stage cases, according to a study published in *JAMA Network Open*. The research indicates that while large language models (LLMs) are highly proficient at identifying a condition once all data is present, they struggle significantly during the “differential diagnosis” phase the uncertain beginning of a case where patient information is often vague or incomplete.
The study, led by researchers at Mass General Brigham, tested 21 different LLMs using clinical vignettes. When faced with limited information, the models frequently narrowed their focus too quickly to a single, often incorrect, answer rather than suggesting a range of possibilities. However, the failure rate dropped to less than 40% once full laboratory results and physical exam findings were provided, with the top-performing models reaching over 90% accuracy in final-stage scenarios.
Tech companies, including Anthropic and Google, have responded by noting that their consumer chatbots are designed to direct users to medical professionals and include reminders to verify AI-generated health information. While experts suggest AI may eventually assist in regions with limited doctor access, they warn that the current technology lacks the “look and feel” of a physical clinical assessment. For now, the study underscores the significant risks of using consumer AI as a primary tool for medical diagnosis.
