LLMs Show Lower Accuracy Than Clinicians for STEMI Diagnosis From ECGs

Tags

Featured

Off

Page Content

#ffffff

Rapid and accurate interpretation of a 12-lead electrocardiogram (ECG) is essential for diagnosing ST-segment elevation myocardial infarction (STEMI) and initiating timely reperfusion therapy. The expanding use of artificial intelligence tools, including large language models (LLMs), has raised interest in their potential clinical applications. However, their diagnostic reliability in high-risk emergency settings remains uncertain.

A case-control study published in the Heart and Lung evaluated the diagnostic performance of GPT-5 and GPT-4o in identifying STEMI from ECG images and compared results with emergency medicine specialists (EMSs) and cardiologists. The study included 234 patients: 117 angiography-confirmed STEMI cases and 117 age- and sex-matched controls presenting with chest pain but without STEMI. Anonymized ECG images were presented to three EMSs, three cardiologists, GPT-5, and GPT-4o. Each evaluator answered a dichotomous question: “Is there a STEMI?” LLMs were queried three times on separate days to assess response consistency.

Cardiologists achieved an accuracy of 89.6% and EMSs 87.8%, both significantly higher than GPT-5 (69.9%) and GPT-4o (55.9%) (p<0.001). GPT-5 demonstrated 85.5% sensitivity, similar to clinicians (86.9%–88.6%), but showed a high false-positive rate of 45.6% compared with cardiologists (7.7%) and EMSs (13.1%). GPT-4o demonstrated lower sensitivity at 76.9%. Consistency analysis showed substantial agreement for GPT-5 (Fleiss’ κ=0.76) but only fair agreement for GPT-4o (κ=0.26).

LLMs showed variable diagnostic performance when interpreting ECG images for STEMI. Current LLMs were not reliable as independent diagnostic tools for STEMI diagnosis.

Anonymous user

Authenticated user

Premium

Paid / Sponsored

Key highlights

In a case-control study of 234 patients, cardiologists and emergency medicine specialists achieved higher diagnostic accuracy for STEMI on ECG images than GPT-5 and GPT-4o.
GPT-5 demonstrated sensitivity comparable to clinicians but produced a markedly higher false-positive rate (45.6%).
GPT-4o showed lower diagnostic accuracy and sensitivity, and current LLMs were not reliable as independent diagnostic tools for STEMI diagnosis.

Source

Kokulu K, Akay M, Sert ET. Accuracy of GPT-5 and GPT-4o in diagnosing STEMI from 12-Lead ECGs: A comparative study with cardiologists and emergency physicians. Heart Lung. Published online March 7, 2026. doi:10.1016/j.hrtlng.2026.102754

Thumbnail