Session Information
Date: Friday, March 20, 2026
Title: Abstracts: Technology
Session Time: 4:49PM-4:54PM
Background/Purpose: Accumulating evidence supports a new, biology-driven classification system for JIA diagnosis and prognosis, depending upon identification of patterns of arthritis involvement, rather than just total active joint count. Abstraction of active joints for clinical research has traditionally relied upon labor-intensive, manual chart review and can require significant domain expertise. As such, even some large clinical observational study efforts, such as the Childhood Arthritis and Rheumatology Research Alliance (CARRA) Registry, routinely only collect the total active joint count, rather than specific joints involved. Recent advances in Large Language Models (LLMs) offer a transformative opportunity to automate and economize this process.
Methods: Our approach employed a structured LLM context utilizing an 83-joint list for classification of arthritis activity in an approach incorporating note section prioritization, edge-case handling, and temporal constraints. Output was structured to include joint identity, classification confidence (low, intermediate, high) and reasoning notes. We utilized GPT-5 (OpenAI, San Francisco, CA) within an internal, HIPAA-compliant instance at Boston Children’s Hospital (BCH). We randomly selected ambulatory notes for patients with JIA from 2017 to 2024. The LLM response was compared against a gold standard derived from expert annotation by a pediatric rheumatologist.
Results: 41 outpatient clinical notes from patients fulfilling ILAR criteria for JIA were identified and classified. The median active joint count was 1, with a range of 0 to 18. The total active joint count was correct in 34 cases, achieving 83% accuracy. When examining individual joints across all cases, the LLM correctly identified 118 affected joints (recall = 87.4%, precision = 85.5%); 20 joints were incorrectly identified as affected (false positive rate = 0.6%) and 17 joints were incorrectly identified as unaffected (false negative rate = 12.6%, specificity = 99.4%). False positives were predominantly due to ambiguous exam documentation (e.g. ‘all fingers swollen’, leading to all MCPs, IPs and PIPs, and DIPs being counted) and ranked as low confidence by the LLM. False negatives were almost entirely due to contradictory information between physical exam and assessment documentation.
Conclusion: Natural language processing of clinical notes from pediatric rheumatology visits was highly performant in identifying active joints using the GPT-5 LLM, with only modest effort required by a domain expert for context engineering for this task. This is consistent with other analysis tasks for state-of-the-art generative LLMs, which are often highly performant at using a ‘zero shot’ (no examples provided) approach. The LLM’s capability to provide confidence levels and explanatory reasoning is likely to offer greatly reduced research coordinator burden for such tasks, as ambiguous or other low-confidence results can be selectively reviewed on a manual basis. We suspect that many other manually-intensive chart abstraction activities for observational clinical studies in pediatric rheumatology may be similarly amenable to automation.
To cite this abstract in AMA style:
Natter M, Lee S, Chang M, Ong M. Identification of Joints with Active Arthritis from Clinical Notes with High Fidelity using a Large Language Model [abstract]. Arthritis Rheumatol. 2026; 78 (suppl 3). https://acrabstracts.org/abstract/identification-of-joints-with-active-arthritis-from-clinical-notes-with-high-fidelity-using-a-large-language-model/. Accessed .« Back to 2026 Pediatric Rheumatology Symposium
ACR Meeting Abstracts - https://acrabstracts.org/abstract/identification-of-joints-with-active-arthritis-from-clinical-notes-with-high-fidelity-using-a-large-language-model/
