A recent study published in Harvard Medical School’s journal collaboration with Beth Israel Deaconess Medical Center explores how large language models perform in clinical decision-making, including real emergency room scenarios. The findings show that, in certain conditions, AI models can match or even outperform human physicians in diagnostic accuracy, particularly in early-stage assessments.

The research evaluated 76 emergency room cases, comparing diagnoses from two internal medicine attending physicians with those generated by OpenAI o1 and OpenAI GPT-4o. Independent reviewers, unaware of the diagnosis source, assessed accuracy. Results showed that the o1 model consistently performed at least on par with physicians and delivered stronger results during initial triage, where limited patient data and time pressure make diagnosis more challenging.

Notably, the study used unprocessed clinical data. AI models received the same information available in electronic medical records at each diagnostic point, without additional structuring or optimization. In triage scenarios, the o1 model achieved exact or near-exact diagnoses in 67% of cases, compared to 55% and 50% for the two physicians. Researchers highlighted this as a signal that AI systems can extract meaningful patterns quickly, even in high-uncertainty contexts.

Despite these results, the study does not position AI as a replacement for clinicians. Instead, it points to the need for real-world trials to better understand how such systems can be integrated into patient care. Current limitations remain, especially in handling non-text inputs such as imaging or physical examination data, where human expertise continues to play a critical role.

The research also raises questions around accountability and clinical responsibility. There is currently no standardized framework governing AI-assisted diagnoses, and patient trust still relies heavily on human judgment, particularly in high-stakes decisions. Some experts also caution that the comparison baseline matters. Evaluating AI against internal medicine physicians rather than emergency medicine specialists may influence perceived performance outcomes.

From an operational perspective, the study reinforces a broader trend: AI systems are becoming increasingly capable in structured reasoning tasks, especially where data is available in consistent formats. For healthcare organizations and technology teams, the opportunity lies in augmenting clinical workflows, not replacing them, with systems that can support faster, more informed decision-making under pressure.

Source