Impact of Large Language Models on Diagnostic Reasoning of Medical Students in Rheumatology: A Randomized Trial

Anna Roemer¹, Nadine Schlicker², Anna Kernder³, Benedikt Albe¹, Juliana Hack⁴, Martin Hirsch⁵, Sebastian Kuhn¹ and Johannes Knitza⁶, ¹Institute for Digital Medicine, University Hospital of Giessen and Marburg, Philipps University Marburg, Marburg, Germany, ²Institute for Artificial Intelligence in Medicine, University Hospital Giessen and Marburg, Marburg, Germany, ³Department of Rheumatology, Rheumazentrum Ruhrgebiet, Herne, Germany, ⁴Center for Orthopaedics and Trauma Surgery, University Hospital Giessen and Marburg, Marburg, Germany, ⁵Institute for Artificial Intelligence in Medicine, University Hospital Giessen and Marburg, Philipps University Marburg, Marburg, Germany, ⁶Institute for Digital Medicine, University Hospital Gießen-Marburg, Philipps University, Marburg, Germany

Meeting: ACR Convergence 2025

Keywords: Education, Health Services Research, practice guidelines, quality of care, Randomized Trial

Session Information

Date: Sunday, October 26, 2025

Title: Abstracts: Professional Education (0825–0830)

Session Type: Abstract Session

Session Time: 3:15PM-3:30PM

Background/Purpose: Although not certified as medical devices, Large Language Models (LLMs) such as ChatGPT-4 provide rapid support in diagnostic reasoning and may facilitate scalable upskilling of healthcare professionals and laypersons alike. This trial evaluated the impact of LLM assistance on the diagnostic reasoning of medical students solving rheumatology case vignettes.

Methods: In this RCT (NCT06748170), 68 medical students were allocated between January 7 and March 30, 2025, to either an intervention group (IG), with access to ChatGPT-4o, alongside conventional diagnostic resources, or a control group (CG), with access to conventional resources only. Participants in the IG first generated diagnostic suggestions using conventional resources and then revised their suggestions after consulting the LLM. All participants completed 3 rheumatology case vignettes—GPA , RA and SLE—originally published in the ACR’s online learning center. Each vignette required a top diagnosis, and optionally up to five diagnostic suggestions in total. The suggested diagnoses were independently and blindly reviewed by two board-certified rheumatologists. Diagnostic accuracy was additionally scored by awarding two points for correct diagnoses and one point for plausible alternatives, generating a cumulative diagnostic score. Time to case completion and diagnostic confidence (0–10) were also recorded.

Results: The mean (SD) age was 24.8 (2.6) years; 62% (42/68) of participants were female. Prior use of LLMs was reported by 96% (65/68) of students. Interrater agreement was substantial (Cohen’s κ = 0.79). Students in the IG identified the correct top diagnosis significantly more often than those in the CG (77.5%, mean 2.3/3 [SD 0.8] vs. 32.4%, mean 1.0/3 [SD 0.8]; independent t-test = 7.3, P < 0.001) and were also more likely to include a correct diagnosis among their top five suggestions (91.2%, mean 2.7/3 [SD 0.5] vs. 47.1%, mean 1.4/3 [SD 0.7]; independent t-test = 8.5, P < 0.001), see Figure 1. The standalone performance of the LLM exceeded that of students using conventional resources, listing the correct diagnosis first in 71.6% of cases and within the top five in 72.5%. Median diagnostic scores per case were 4 (IQR, 3–5) in the IG, 2 (IQR, 1–3) in the CG, and 5 (IQR, 3.3–6) for the LLM. Median time to case completion was 498 seconds (IQR, 371–609) in the IG and 253 seconds (IQR, 175–395) in the CG. LLM use significantly increased the proportion of correct top diagnoses in the IG, from 46.1% (mean 1.4/3 [SD 0.7]) to 77.5% (mean 2.3/3 [SD 0.8]); paired t-test = 7.1, P < 0.001, see Figure 2. Diagnostic confidence in the IG also increased significantly following LLM use (mean 5.2/10 [SD 1.5] to 7.0/10 [SD 1.3]; paired t-test = –9.4, P < 0.001). Among IG participants, 97% (33/34) reported they would use the LLM again for diagnostic support, and 91% (31/34) found it easy to use.

Conclusion: To our knowledge, this is the first RCT to evaluate the diagnostic assistive potential of LLMs in rheumatology. Providing medical students with access to an LLM significantly improved diagnostic accuracy compared to conventional resources alone. Further research is warranted to determine how LLMs can be implemented to safely empower healthcare professionals as well as patients.

Proportion of correct diagnoses ranked first or within the top five suggestions

Distribution of top diagnosis categories in the intervention group across study phases

Disclosures: A. Roemer: None; N. Schlicker: None; A. Kernder: None; B. Albe: None; J. Hack: None; M. Hirsch: None; S. Kuhn: None; J. Knitza: GAIA, 2, Vila Health, 12,, 2.

To cite this abstract in AMA style:

Roemer A, Schlicker N, Kernder A, Albe B, Hack J, Hirsch M, Kuhn S, Knitza J. Impact of Large Language Models on Diagnostic Reasoning of Medical Students in Rheumatology: A Randomized Trial [abstract]. Arthritis Rheumatol. 2025; 77 (suppl 9). https://acrabstracts.org/abstract/impact-of-large-language-models-on-diagnostic-reasoning-of-medical-students-in-rheumatology-a-randomized-trial/. Accessed .

« Back to ACR Convergence 2025

ACR Meeting Abstracts - https://acrabstracts.org/abstract/impact-of-large-language-models-on-diagnostic-reasoning-of-medical-students-in-rheumatology-a-randomized-trial/