Session Information
Session Type: ACR Poster Session C
Session Time: 9:00AM-11:00AM
Background/Purpose: Epidemiologic studies of ANCA-associated vasculitis (AAV) using large data sets are often limited by the lack of validated definitions of AAV cases that can be applied on a large scale. A prior study developed algorithms using billing codes, prescription records, and ANCA pattern (not antigen specificity) to classify patients into traditional clinical phenotypes (e.g., granulomatosis with polyangiitis, GPA) with PPV ranging from 81% to 100%. We sought to determine whether a user-friendly natural language processing (NLP) tool could improve the performance of AAV case-finding algorithms in an electronic health record (EHR) database.
Methods: Using EHR data on 2 million patients from a large, multi-center healthcare system that includes Massachusetts General Hospital (MGH) and Brigham and Women’s Hospital (BWH), we evaluated the performance of algorithms that incorporated billing codes, ANCA antigen specificity test results, and/or NLP to identify patients with AAV. Unstructured data (e.g., pathology reports, clinical notes) were searched using NLP for key words and phrases suggestive of AAV. The NLP program eliminates reports where the search phrase is near a term that may negate a diagnosis of AAV (e.g., “the patient does not have ANCA-associated vasculitis”). To assess the performance (Positive Predictive Value, PPV) of each algorithm, a cohort of patients with and without AAV was identified from a population of 35,623 patients. We then evaluated the performance of each algorithm in randomly assembled cohorts of patients evaluated in rheumatology and nephrology clinics.
Results: The general AAV cohort used for primary validation was established from the entire population and included 207 patients, the majority of whom had AAV (N=161, 78%). This cohort included 25 patients (12.1%) with positive ANCA test results but without AAV. An algorithm solely using billing codes had a PPV of 79% (73%-84%), 18% (5%-40%), and 4% (0%-14%) for identifying cases of AAV in the entire EHR, a rheumatology clinic cohort, and nephrology clinic cohort, respectively (Table 1). An algorithm that required an NLP reference to AAV, a billing code associated with AAV, and a positive PR3- or MPO-ANCA test result led to a PPV of 95% (88%-98%), 100%, and 100%, respectively.
Conclusion: In our study, the use of NLP substantially improved the PPV of algorithms meant to identify cases of AAV. In the context of increasingly large data sources that include both structured (e.g., billing codes, test results) and unstructured data (e.g., clinical notes), NLP can improve the ability to accurately (PPV > 90%) classify patients with AAV. Furthermore, as ANCA type is increasingly viewed as a superior approach to differentiating AAV subtypes compared with clinical phenotypes (e.g., GPA), an algorithm such as ours that incorporate ANCA types can be useful for future epidemiologic studies in AAV using EHRs.
Table 1: Algorithm Performance in 207 Patients Selected based on ICD-9 Codes for ANCA-Associated Vasculitis |
Total Possible AAV Cases Identified by Algorithm in EHR |
Positive Predictive Value (95% CI) |
1. ICD-9 code |
20,557 |
79% (73%-84%) |
2. ICD-9 and ANCA-positive |
1,951 |
88% (82%-92%) |
3. NLP and ANCA-positive |
898 |
92% (87%-98%) |
4. NLP or ICD-9 and ANCA-positive |
2,065 |
87% (80%-91%) |
5. NLP and ICD-9 and ANCA-positive |
775 |
95% (88%-98%) |
To cite this abstract in AMA style:
Wallace Z, Stone JH, Choi HK. Identifying ANCA-Associated Vasculitis Cases in Electronic Health Records Using Natural Language Processing [abstract]. Arthritis Rheumatol. 2018; 70 (suppl 9). https://acrabstracts.org/abstract/identifying-anca-associated-vasculitis-cases-in-electronic-health-records-using-natural-language-processing/. Accessed .« Back to 2018 ACR/ARHP Annual Meeting
ACR Meeting Abstracts - https://acrabstracts.org/abstract/identifying-anca-associated-vasculitis-cases-in-electronic-health-records-using-natural-language-processing/