Tehran University of Medical Sciences

Science Communicator Platform

Stay connected! Follow us on X network (Twitter):
Share By
Diagnostic Performance of Chatgpt in Tibial Plateau Fracture in Knee X-Ray Publisher Pubmed

Summary: Can AI diagnose fractures? Study finds ChatGPT-4o matches physicians for tibial fractures. #AIinMedicine #TibialFractures

Mohammadi M1 ; Parviz S2 ; Parvaz P3 ; Pirmoradi MM1 ; Afzalimoghaddam M1, 4 ; Mirfazaelian H4
Authors

Source: Emergency Radiology Published:2025


Abstract

Purpose: Tibial plateau fractures are relatively common and require accurate diagnosis. Chat Generative Pre-Trained Transformer (ChatGPT) has emerged as a tool to improve medical diagnosis. This study aims to investigate the accuracy of this tool in diagnosing tibial plateau fractures. Methods: A secondary analysis was performed on 111 knee radiographs from emergency department patients, with 29 confirmed fractures by computed tomography (CT) imaging. The X-rays were reviewed by a board-certified emergency physician (EP) and radiologist and then analyzed by ChatGPT-4 and ChatGPT-4o. The diagnostic performances were compared using the area under the receiver operating characteristic curve (AUC). Sensitivity, specificity, and likelihood ratios were also calculated. Results: The results indicated a sensitivity and negative likelihood ratio of 58.6% (95% CI: 38.9 − 76.4%) and 0.4 (95% CI: 0.3–0.7) for the EP, 72.4% (95% CI: 52.7 − 87.2%) and 0.3 (95% CI: 0.2–0.6) for the radiologist, 27.5% (95% CI: 12.7 − 47.2%) and 0.7 (95% CI: 0.6–0.9) for ChatGPT-4, and 55.1% (95% CI: 35.6 − 73.5%) and 0.4 (95% CI: 0.3–0.7) for ChatGPT4o. The specificity and positive likelihood ratio were 85.3% (95% CI: 75.8 − 92.2%) and 4.0 (95% CI: 2.1–7.3) for the EP, 76.8% (95% CI: 66.2 − 85.4%) and 3.1 (95% CI: 1.9–4.9) for the radiologist, 95.1% (95% CI: 87.9 − 98.6%) and 5.6 (95% CI: 1.8–17.3) for ChatGPT-4, and 93.9% (95% CI: 86.3 − 97.9%) and 9.0 (95% CI: 3.6–22.4) for ChatGPT4o. The area under the receiver operating characteristic curve (AUC) was 0.72 (95% CI: 0.6–0.8) for the EP, 0.75 (95% CI: 0.6–0.8) for the radiologist, 0.61 (95% CI: 0.4–0.7) for ChatGPT-4, and 0.74 (95% CI: 0.6–0.8) for ChatGPT4-o. The EP and radiologist significantly outperformed ChatGPT-4 (P value = 0.02 and 0.01, respectively), whereas there was no significant difference between the EP, ChatGPT-4o, and radiologist. Conclusion: ChatGPT-4o matched the physicians’ performance and also had the highest specificity. Similar to the physicians, ChatGPT chatbots were not suitable for ruling out the fracture. © The Author(s), under exclusive licence to American Society of Emergency Radiology (ASER) 2024.
Other Related Docs