ACR Meeting Abstracts

ACR Meeting Abstracts

  • Meetings
    • ACR Convergence 2024
    • ACR Convergence 2023
    • 2023 ACR/ARP PRSYM
    • ACR Convergence 2022
    • ACR Convergence 2021
    • ACR Convergence 2020
    • 2020 ACR/ARP PRSYM
    • 2019 ACR/ARP Annual Meeting
    • 2018-2009 Meetings
    • Download Abstracts
  • Keyword Index
  • Advanced Search
  • Your Favorites
    • Favorites
    • Login
    • View and print all favorites
    • Clear all your favorites
  • ACR Meetings

Abstract Number: 1740

Performance of Large Language Models in Rheumatology Board-Like Questions: Accuracy, Quality, and Safety

Jaime Flores Gouyonnet1, Mariana Gonzalez-Trevino1, Cynthia Crowson1, Ryan Lennon1, Alain Sanchez-Rodriguez2, Gabriel Figueroa-Parra3, Elena Joerns4, Bradly Kimbrough5, Maria Cuellar-Gutierrez1, Erika Navarro-Mendoza1 and Ali Duarte-Garcia1, 1Mayo Clinic, Rochester, MN, 2Mayo Clinic College of Medicine and Science, Rochester, MN, 3Division of Rheumatology, University Hospital "Dr. Jose Eleuterio Gonzalez", Universidad Autonoma de Nuevo Leon, Monterrey, Mexico, 4Mayo Clinic, Rochester, 5Mayo Clinic Rochester, Rochester, MN

Meeting: ACR Convergence 2024

Keywords: autoimmune diseases, Autoinflammatory diseases, education, medical, ethics, Surveys

  • Tweet
  • Click to email a link to a friend (Opens in new window) Email
  • Click to print (Opens in new window) Print
Session Information

Date: Sunday, November 17, 2024

Title: Abstracts: Professional Education

Session Type: Abstract Session

Session Time: 3:00PM-4:30PM

Background/Purpose: Large language models (LLMs) are increasingly becoming a common source of information for clinicians. We aimed to evaluate the accuracy, quality, and safety of the responses provided by three LLMs to rheumatology questions and images.

Methods: We tested three LLMs: GPT-4, Claude3: Opus, and Gemini Advanced. We used 40 multiple-choice questions (10 with images) from the ACR CARE-2022 Question Bank (CQB). Accuracy was defined as the LLM’s answers considered “correct” using the CQB answers as the gold standard. Then, five board-certified international rheumatologists evaluated in a blinded fashion (i.e., not knowing which LLM answered the question) the LLMs’ answer quality and safety in 7 different domains: 1. scientific consensus, 2. evidence of comprehension, 3. evidence of retrieval, 4. evidence of reasoning, 5. inappropriate/incorrect content, 6. missing content, and 7. possibility of harm (Table 1). Domains 1-6 were evaluated using a 5-element Likert scale. Domain 7 was evaluated when the LLM provided an incorrect answer by assessing if the answer could cause harm and rating the extent as mild, moderate, or severe. If the LLM refused to answer a question or provided two answers for a single question, they were considered incorrect, and no domain analysis was performed. Multimodal logistic regression was used to compare the Likert responses in the first 6 domains between LLMs, with questions and raters modeled with random effects.

Results: GPT-4 and Claude3: Opus answered all the questions; Gemini Advanced refused to answer 11 questions (27.5%). GPT-4 provided two answers for two questions (5%).

GPT-4 answered 78% (31/40) of the questions correctly, Claude 3: Opus 63% (25/40), and Gemini Advanced 53% (21/40). Regarding the questions that included image analysis, GPT-4 and Claude 3: Opus obtained a score of 80% (8/10), while Gemini Advanced obtained 30% (3/10).

GPT-4 outperformed Claude 3: Opus in domains 1 (p< 0.001), 4 (p< 0.001), 5 (p=0.007), and 6 (p=0.011). While Gemini Advanced performed worse than GPT-4 in all 6 domains (p< 0.001) and worse than Claude 3: Opus in domains 1 (p=0.01), 2 (p< 0.001), 3 (p< 0.001), 4 (p< 0.001) and 6 (p< 0.001) (Table 2).

The percentage of incorrectly answered questions evaluated as having possibility of harm was similar in all 3 models: Gemini Advanced 75% (6/8), Claude 3: Opus 73% (11/15) and GPT-4 71% (5/7). However, Gemini Advanced had the highest percentage of “severe harm” answers (52%), followed by Claude Opus (42%). GPT-4 had the lowest percentage of “severe harm” answers (16%) (Figure).

Conclusion: Our study evaluated the accuracy, quality, and safety of responses from three LLMs to rheumatology questions. GPT-4 outperformed the others, achieving the highest accuracy and superior quality scores in multiple domains, with a lower incidence of severe harm. Claude 3: Opus had moderate accuracy. Gemini Advanced showed the lowest accuracy, poorest performance in image analysis, highest refusal rate, and the highest potential for severe harm. Continuous evaluation and improvement of LLMs are crucial for their safe clinical application, especially in complex fields like rheumatology.

Supporting image 1

Table 1. Domain Description

Supporting image 2

Table 2. Domains 1-6 assessment performed by five rheumatologists on the total number of answered questions by each LLM.

Supporting image 3

Figure. Percentage of incorrect questions with possibility of harm rated as “mild,” “moderate,” or “severe.”


Disclosures: J. Flores Gouyonnet: None; M. Gonzalez-Trevino: None; C. Crowson: None; R. Lennon: None; A. Sanchez-Rodriguez: None; G. Figueroa-Parra: None; E. Joerns: Pfizer, 5; B. Kimbrough: None; M. Cuellar-Gutierrez: None; E. Navarro-Mendoza: None; A. Duarte-Garcia: None.

To cite this abstract in AMA style:

Flores Gouyonnet J, Gonzalez-Trevino M, Crowson C, Lennon R, Sanchez-Rodriguez A, Figueroa-Parra G, Joerns E, Kimbrough B, Cuellar-Gutierrez M, Navarro-Mendoza E, Duarte-Garcia A. Performance of Large Language Models in Rheumatology Board-Like Questions: Accuracy, Quality, and Safety [abstract]. Arthritis Rheumatol. 2024; 76 (suppl 9). https://acrabstracts.org/abstract/performance-of-large-language-models-in-rheumatology-board-like-questions-accuracy-quality-and-safety/. Accessed .
  • Tweet
  • Click to email a link to a friend (Opens in new window) Email
  • Click to print (Opens in new window) Print

« Back to ACR Convergence 2024

ACR Meeting Abstracts - https://acrabstracts.org/abstract/performance-of-large-language-models-in-rheumatology-board-like-questions-accuracy-quality-and-safety/

Advanced Search

Your Favorites

You can save and print a list of your favorite abstracts during your browser session by clicking the “Favorite” button at the bottom of any abstract. View your favorites »

All abstracts accepted to ACR Convergence are under media embargo once the ACR has notified presenters of their abstract’s acceptance. They may be presented at other meetings or published as manuscripts after this time but should not be discussed in non-scholarly venues or outlets. The following embargo policies are strictly enforced by the ACR.

Accepted abstracts are made available to the public online in advance of the meeting and are published in a special online supplement of our scientific journal, Arthritis & Rheumatology. Information contained in those abstracts may not be released until the abstracts appear online. In an exception to the media embargo, academic institutions, private organizations, and companies with products whose value may be influenced by information contained in an abstract may issue a press release to coincide with the availability of an ACR abstract on the ACR website. However, the ACR continues to require that information that goes beyond that contained in the abstract (e.g., discussion of the abstract done as part of editorial news coverage) is under media embargo until 10:00 AM ET on November 14, 2024. Journalists with access to embargoed information cannot release articles or editorial news coverage before this time. Editorial news coverage is considered original articles/videos developed by employed journalists to report facts, commentary, and subject matter expert quotes in a narrative form using a variety of sources (e.g., research, announcements, press releases, events, etc.).

Violation of this policy may result in the abstract being withdrawn from the meeting and other measures deemed appropriate. Authors are responsible for notifying colleagues, institutions, communications firms, and all other stakeholders related to the development or promotion of the abstract about this policy. If you have questions about the ACR abstract embargo policy, please contact ACR abstracts staff at [email protected].

Wiley

  • Online Journal
  • Privacy Policy
  • Permissions Policies
  • Cookie Preferences

© Copyright 2025 American College of Rheumatology