Natural Language Processing to Identify Lupus Nephritis Phenotype in Electronic Health Records

Yu Deng¹, Jennifer Pacheco¹, Anh Chung¹, Chengsheng Mao¹, Joshua Smith², juan zhao¹, Wei-Qi Wei², April Barnado³, Chunhua Weng⁴, Cong Liu⁴, Adam Gordon¹, Jingzhi Yu¹, Yacob Tedla¹, Abel Kho¹, Rosalind Ramsey-Goldman¹, Theresa Walunas¹ and Yuan Luo¹, ¹Northwestern University, Chicago, IL, ²Vanderbilt Universty Medical Center, Nashville, TN, ³Vanderbilt University Medical Center, Nashville, TN, ⁴Columbia University, New York, NY

Meeting: ACR Convergence 2021

Keywords: computational phenotyping, electronic health records, Lupus nephritis, natural language processing, Systemic lupus erythematosus (SLE)

Session Information

Date: Saturday, November 6, 2021

Title: SLE – Diagnosis, Manifestations, & Outcomes Poster I: Diagnosis (0323–0356)

Session Type: Poster Session A

Session Time: 8:30AM-10:30AM

Background/Purpose: Lupus nephritis (LN) is a major disease manifestation of Systemic lupus erythematosus (SLE) leading to organ damage and increased mortality. Accurately identifying lupus nephritis in electronic health records (EHRs), a key component of SLE classification criteria domain, would add value to observational studies and clinical trials. However, information related to LN, e.g., kidney biopsy findings are usually present in clinical notes, not as structured data. In this study, we developed algorithms to identify LN with and without natural language processing (NLP) using EHR data from the Northwestern Medicine Enterprise Data Warehouse (NMEDW). We hypothesize that NLP algorithms including information from the clinical notes will outperform the baseline algorithm using structured data only.

Methods: We identified 472 patients with SLE from the Chicago Lupus Database who also had at least four encounters in the NMEDW. We developed four algorithms: a rule-based algorithm using only structured data and three different NLP algorithms based on L2-regularized logistic regression. In the first NLP algorithm (Full-MetaMap-binary), we used the presence or absence of all the MetaMap extracted concept unique identifiers (CUIs) as features. In the second NLP algorithm (Full-MetaMap-count), we used the same CUIs as features but their number of occurrences as the feature value. In the third NLP algorithm (MetaMap-mixed), we used a mixture of features from structured data, regular expression (regex) concepts, and a curated list of CUIs related to LN. We evaluated all four algorithms in an internal validation dataset based on F-measure, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV). We further validated the baseline algorithm and the best performed NLP algorithm on an external dataset from Vanderbilt University Medical Center (VUMC).

Results: In the NMEDW internal validation dataset, the Full-MetaMap-binary, Full-MetaMap-count, and MetaMap mixed model achieved F measures of 0.72, 0.71, 0.79, respectively, compared to the baseline model (F measure, 0.41) (see Table). In the external validation dataset, our best performing NLP model (MetaMap mixed model) improved F measure (0.62 vs 0.96) compared to the structured data only algorithm.

Table 1. Algorithm performance

Dataset	Algorithm	Sensitivity	Specificity	PPV	NPV	F Measure
NMEDW (internal validation)	Baseline	0.43	0.6	0.39	0.64	0.41
NMEDW (internal validation)	Full-MetaMap-binary	0.63	0.93	0.85	0.81	0.72
NMEDW (internal validation)	Full-MetaMap-count	0.6	0.95	0.88	0.8	0.71
NMEDW (internal validation)	MetaMap-mixed	0.74	0.92	0.84	0.86	0.79
VUMC	Baseline	0.92	0.61	0.46	0.96	0.62
VUMC	MetaMap-mixed	1	0.97	0.93	1	0.96

Conclusion: We developed three NLP models and compared them to a structured data only algorithm to identify LN from EHR. The best performing NLP algorithm incorporating structured data, CUIs, and regex concepts improved the F-measure in both internal and external validation datasets. NLP algorithms can serve as powerful tools to accurately identify LN in EHR for clinical research.

Disclosures: Y. Deng, None; J. Pacheco, None; A. Chung, None; C. Mao, None; J. Smith, None; j. zhao, None; W. Wei, None; A. Barnado, None; C. Weng, None; C. Liu, None; A. Gordon, None; J. Yu, None; Y. Tedla, None; A. Kho, Datavant, 1, 7, 11; R. Ramsey-Goldman, None; T. Walunas, None; Y. Luo, None.

To cite this abstract in AMA style:

Deng Y, Pacheco J, Chung A, Mao C, Smith J, zhao j, Wei W, Barnado A, Weng C, Liu C, Gordon A, Yu J, Tedla Y, Kho A, Ramsey-Goldman R, Walunas T, Luo Y. Natural Language Processing to Identify Lupus Nephritis Phenotype in Electronic Health Records [abstract]. Arthritis Rheumatol. 2021; 73 (suppl 9). https://acrabstracts.org/abstract/natural-language-processing-to-identify-lupus-nephritis-phenotype-in-electronic-health-records/. Accessed .

« Back to ACR Convergence 2021

ACR Meeting Abstracts - https://acrabstracts.org/abstract/natural-language-processing-to-identify-lupus-nephritis-phenotype-in-electronic-health-records/