Session Information
Session Type: Poster Session A
Session Time: 10:30AM-12:30PM
Background/Purpose: Timely access to current rheumatology guidelines at the point of care presents a significant challenge. Large Language Models (LLMs) offer potential solutions, but their propensity for “hallucinations” raises safety concerns. The primary objective of this study was to develop and evaluate a novel Retrieval-Augmented Generation (RAG) system, the first of its kind specifically for adult rheumatology, integrating European Alliance of Associations for Rheumatology (EULAR) and American College of Rheumatology (ACR) guidelines to provide clinicians with timely, evidence-based recommendations.
Methods: Seventy-four clinically relevant EULAR and ACR management guidelines for adult rheumatology were selected and processed. A RAG system was implemented using the LangChain framework, voyage-3 embedding model, and a Qdrant vector database, see Figure 1. For evaluation, 740 guideline-specific questions were generated. Answers were produced by an LLM (ChatGPT-o3-mini) with context retrieval (RAG) and without (baseline). Performance was assessed by an LLM-as-a-judge (Gemini 2.0 Flash) using a 5-point Likert scale across five dimensions (relevance, factual accuracy, safety, completeness, conciseness) and by determining preference. Wilcoxon signed-rank and Binomial tests were used for statistical analysis. Two blinded rheumatologists independently validated a random 15% sample of questions.
Results: The LLM-as-a-judge evaluation revealed that the RAG system significantly outperformed the baseline system across all criteria (p< 0.001). The RAG system was significantly preferred by the LLM-as-a-judge in 92.8% of comparisons (p< 0.001), Table 1 Manual evaluation by rheumatologists confirmed these findings, with significant improvements in accuracy, safety, and completeness for the RAG system (p< 0.001), which was preferred in 71.2%-74.8% of comparisons (p< 0.001), Table 2.
Conclusion: This study successfully developed and validated a RAG system integrating EULAR and ACR guidelines for adult rheumatology. The system significantly enhances the quality and reliability of LLM-generated answers, providing a robust foundation for AI-driven clinical decision support tools. Such tools have the potential to improve guideline adherence and evidence-based practice in rheumatology by offering clinicians rapid, context-aware access to recommendations.
Figure 1: Walkthrough of the entire process—from initial creation to final evaluation—of the RAG architecture proposed
Table 1: LLM-as-a-judge evaluation results.
Table 2: Manual evaluation results.
To cite this abstract in AMA style:
Madrid A, Benavent D, Plasencia-Rodríguez C, Rosales-Rosado Z, Merino-Barbancho B, FREITES D. Optimizing the Clinical Application of Rheumatology Guidelines Using Large Language Models: A Retrieval-Augmented Generation Framework Integrating EULAR and ACR Recommendations [abstract]. Arthritis Rheumatol. 2025; 77 (suppl 9). https://acrabstracts.org/abstract/optimizing-the-clinical-application-of-rheumatology-guidelines-using-large-language-models-a-retrieval-augmented-generation-framework-integrating-eular-and-acr-recommendations/. Accessed .« Back to ACR Convergence 2025
ACR Meeting Abstracts - https://acrabstracts.org/abstract/optimizing-the-clinical-application-of-rheumatology-guidelines-using-large-language-models-a-retrieval-augmented-generation-framework-integrating-eular-and-acr-recommendations/