Rapid and accurate interpretation of a 12-lead electrocardiogram (ECG) is essential for diagnosing ST-segment elevation myocardial infarction (STEMI) and initiating timely reperfusion therapy. The expanding use of artificial intelligence tools, including large language models (LLMs), has raised interest in their potential clinical applications. However, their diagnostic reliability in high-risk emergency settings remains uncertain.
A case-control study published in the Heart and Lung evaluated the diagnostic performance of GPT-5 and GPT-4o in identifying STEMI from ECG images and compared results with emergency medicine specialists (EMSs) and cardiologists. The study included 234 patients: 117 angiography-confirmed STEMI cases and 117 age- and sex-matched controls presenting with chest pain but without STEMI. Anonymized ECG images were presented to three EMSs, three cardiologists, GPT-5, and GPT-4o. Each evaluator answered a dichotomous question: “Is there a STEMI?” LLMs were queried three times on separate days to assess response consistency.
Cardiologists achieved an accuracy of 89.6% and EMSs 87.8%, both significantly higher than GPT-5 (69.9%) and GPT-4o (55.9%) (p<0.001). GPT-5 demonstrated 85.5% sensitivity, similar to clinicians (86.9%–88.6%), but showed a high false-positive rate of 45.6% compared with cardiologists (7.7%) and EMSs (13.1%). GPT-4o demonstrated lower sensitivity at 76.9%. Consistency analysis showed substantial agreement for GPT-5 (Fleiss’ κ=0.76) but only fair agreement for GPT-4o (κ=0.26).
LLMs showed variable diagnostic performance when interpreting ECG images for STEMI. Current LLMs were not reliable as independent diagnostic tools for STEMI diagnosis.