Evaluating ChatGPT's Performance in Diagnosing Low Back Pain: A Comparison with Clinicians and Impact of Prompted Specialties

Annika Nack¹, Xabier Michelena Vegas², Pol Maymó-Paituvi³, Cristina Calomarde-Gómez⁴, david lobo⁵, Asier García-Alija⁶, Raquel Ugena-García⁴, Maria Aparicio¹, Paola Vidal Montal⁷ and Diego Benavent⁸, ¹Hospital Germans Trias i Pujol, Barcelona, Spain, ²Hospital Universitari Vall Hebron, Barcelona, Spain, ³Hospital Universitari de Bellvitge, L'Hospitalet de Llobregat, Spain, ⁴Hospital Universitari Germans Trias i Pujol, Badalona, Spain, ⁵Doctor Josep Trueta University Hospital, Girona, Catalonia, Spain, ⁶Sant Pau Hospital, Barcelona, Catalonia, Spain, ⁷Bellvitge University Hospital, Barcelona, Spain, ⁸Hospital Universitari de Bellvitge, Madrid, Spain

Meeting: ACR Convergence 2025

Keywords: Back pain, Diagnostic criteria, pain, spondyloarthritis, Spondyloarthropathies

Session Information

Date: Sunday, October 26, 2025

Title: (0522–0553) Spondyloarthritis Including Psoriatic Arthritis – Diagnosis, Manifestations, & Outcomes Poster I

Session Type: Poster Session A

Session Time: 10:30AM-12:30PM

Background/Purpose: Low back pain (LBP) is a multifactorial condition managed by various specialists. AI chatbots like ChatGPT may help clinicians identify probable diagnoses. Given that query phrasing can influence outputs, we hypothesized that ChatGPT’s responses may vary depending on the specialty stated in the prompt.We aimed to evaluate whether ChatGPT’s diagnostic output changes when simulating different specialties in LBP assessment, and to compare its diagnostic accuracy to that of clinicians.

Methods: A total of 10 clinical cases related to LBP were included from official public exams for rheumatologists in Spain, designed to assess expertise for permanent specialist positions. These included 5 cases of rheumatologic diseases and 5 representing other causes. The exercise was conducted in December 2024 using ChatGPT version 4.o. Ten clinicians with at least 5 years of experience in managing rheumatic and musculoskeletal diseases (RMDs) participated in the study. Each question was answered independently by clinicians, and at a later stage by each participant asking ChatGPT, simulating five specialties (Rheumatology, Neurology, Internal Medicine, Rehabilitation, and Orthopedics). The gold standard was the official answer listed as diagnosis for each exam question. Diagnostic performance was evaluated using precision (percentage of cases where the top diagnosis matched the gold standard) and sensitivity (percentage of cases where the gold standard was included in the top three probable diagnoses). The time taken to answer all 10 clinical cases was recorded for both clinicians alone and using ChatGPT, starting when the case was reviewed and stopping when three differential diagnoses and the most probable diagnosis were finalized.

Results: In total, 528 free-text diagnoses were generated and standardized into 39 diagnostic categories. The percentage of the correct score for each participant and each prompted specialty is illustrated in Figure 1. Median precision ranged from 70% to 80% across the five specialties simulated by ChatGPT, and median sensitivity ranged from 80% to 90%. Statistical analysis revealed no significant differences in precision (p = 0.80) or sensitivity (p = 0.68) between the specialties simulated by ChatGPT, indicating consistent performance regardless of the prompted specialty. For clinicians, the median precision was 60%, and median sensitivity was 80%. When comparing ChatGPT to clinicians, ChatGPT had significantly higher diagnostic precision (median = 75% vs. 60%, p < 0.001) and significantly higher sensitivity (median = 85% vs. 80%, p = 0.02). The mean time taken by participants to complete the task was 12.35±5.62 minutes, compared to 2.33±0.03 minutes for ChatGPT (p< 0.01).

Conclusion: ChatGPT provides consistent diagnostic performance across simulated specialties, unaffected by the prompt’s semantic framing. It may outperform clinicians in both diagnostic precision and sensitivity, highlighting its potential as a valuable complementary tool for generating fast, accurate and comprehensive differential diagnoses in cases of low back pain. Further research is needed to explore its application in clinical practice and its ability to enhance diagnostic workflows.

Figure 1. Percentage of the correct score for each participant and each prompted speciality

Disclosures: A. Nack: None; X. Michelena Vegas: None; P. Maymó-Paituvi: None; C. Calomarde-Gómez: None; d. lobo: None; A. García-Alija: None; R. Ugena-García: None; M. Aparicio: None; P. Vidal Montal: None; D. Benavent: AbbVie/Abbott, 2, 6, Eli Lilly, 6, Janssen, 6, Novartis, 5, 6, Pfizer, 6, Savana, 7, UCB, 2, 6.

To cite this abstract in AMA style:

Nack A, Michelena Vegas X, Maymó-Paituvi P, Calomarde-Gómez C, lobo d, García-Alija A, Ugena-García R, Aparicio M, Vidal Montal P, Benavent D. Evaluating ChatGPT’s Performance in Diagnosing Low Back Pain: A Comparison with Clinicians and Impact of Prompted Specialties [abstract]. Arthritis Rheumatol. 2025; 77 (suppl 9). https://acrabstracts.org/abstract/evaluating-chatgpts-performance-in-diagnosing-low-back-pain-a-comparison-with-clinicians-and-impact-of-prompted-specialties/. Accessed .

« Back to ACR Convergence 2025

ACR Meeting Abstracts - https://acrabstracts.org/abstract/evaluating-chatgpts-performance-in-diagnosing-low-back-pain-a-comparison-with-clinicians-and-impact-of-prompted-specialties/