Missing Data and Multiple Imputation in Rheumatoid Arthritis Registries Using Sequential Random Forest Method

Ahmed Al-Saber¹, Adeeba Al-Herz², Jiazhu Pan¹, Khulood Saleh³, Adel Al-Awadhi⁴, Waleed Al-Kandari³, Eman Hasan⁵, Aqeel Ghanem⁶, Mohammed Hussain⁵, Yaser Ali⁷, Ebrahim Nahar⁷, Ahmad Alenizi⁸, Sawsan Hayat⁷, Fatemah Abutiban⁹, Ali Aldei⁵, Amjad Alkadi¹⁰, Heba Alhajeri⁷, Husain Behbehani³, Naser Alhadhood³, Khaled Mokaddem⁵, Ahmed Khadrawy³, Ammad Fazal³, Agaz Zaman⁷, Ghada Mazloum⁷, Youssef Bartella⁵, Sally Hamed⁵ and Ramia Alsouk⁸, ¹Department of Mathematics and Statistics, University of Strathclyde, Glasgow, United Kingdom, ²Al-Amiri Hospital, Kuwait City, Kuwait, ³Farwania Hosiptal, Farwania, Kuwait, ⁴Faculty of Medicine, Kuwait University, Jabria, Kuwait, ⁵Al-Amiri Hospital, Kuwait, Kuwait, ⁶Mubarak Al-Kabeer Hospital, Kuwait City, Kuwait, ⁷Mubarak Al-Kabeer Hospital, Hawally, Kuwait, ⁸Jahra Hospital, Jahra, Kuwait, ⁹Rheumatology Unit, Department of Medicine, Jaber Alahmed Alsabah hospital, State of Kuwait, Jahra, Kuwait, ¹⁰Sabah Hospital, Kuwait, Kuwait

Meeting: ACR Convergence 2020

Keywords: Biologicals, Epidemiology, registry, rheumatoid arthritis, Statistical methods

Session Information

Date: Sunday, November 8, 2020

Title: Epidemiology & Public Health Poster III: Inflammatory Rheumatic Disease

Session Type: Poster Session C

Session Time: 9:00AM-11:00AM

Background/Purpose: Missing data in clinical epidemiological researches violate the intention to treat principle, reduce statistical power and can induce bias if they are related to patient’s response to treatment. In multiple imputation (MI), covariates are included in the imputation equation to predict the values of missing data. The purpose of this study is to find the best approach to estimate and impute the missing values in Kuwait Registry for Rheumatic Diseases (KRRD) patients data.

Methods: A number of methods were implemented for dealing with missing data. These included Multivariate imputation by chained equations (MICE), K-Nearest Neighbors (KNN), Bayesian Principal Component Analysis (BPCA), EM with Bootstrapping (Amelia II), Sequential Random Forest (MissForest) and mean imputation. Choosing the best imputation method was judged by the minimum scores of Root Mean Square Error (RMSE), Mean Absolute Error (MAE) and Kolmogorov–Smirnov D test statistic (KS) between the imputed datapoints and the original datapoints that were subsequently sat to missing.

Results: A total of 1,685 rheumatoid arthritis (RA) patients and 10,613 hospital visits were included in the registry. Among them, we found a number of variables that had missing values exceeding 5% of the total values. These included duration of RA (13.0%), smoking history (26.3%), rheumatoid factor (7.93%), anti-citrullinated peptide antibodies (20.5%), anti-nuclear antibodies (20.4%), sicca symptoms (19.2%), family history of a rheumatic disease (28.5%), steroid therapy (5.94%), ESR (5.16%), CRP (22.9%) and SDAI (38.0%), The results showed that among the methods used, MissForest gave the highest level of accuracy to estimate the missing values. It had the least imputation errors for both continuous and categorical variables at each frequency of missingness and it had the smallest prediction differences when the models used imputed laboratory values. In both data sets, MICE had the second least imputation errors and prediction differences, followed by KNN and mean imputation.

Conclusion: MissForest is a highly accurate method of imputation for missing data in KRRD and outperforms other common imputation techniques in terms of imputation error and maintenance of predictive ability with imputed values in clinical predictive models. This approach can be used in registries to improve the accuracy of data, including the ones for rheumatoid arthritis patients.

Root mean square error (RMSE) of various missing data imputation methods, showing that MissForest method has the lowest error to predict the missing information in KRRD data

Disclosure: A. Al-Saber, None; A. Al-Herz, None; J. Pan, None; K. Saleh, None; A. Al-Awadhi, None; W. Al-Kandari, None; E. Hasan, None; A. Ghanem, AbbVie, 5, 8, Algorithm, 5, 8, Bristol Myers Squibb, 5, 8, Gilead, 5, 8, GlaxoSmithKline, 5, 8, Janssen, 5, 8, Lilly, 5, 8, New Bridge, 5, 8, Novartis, 5, 8, Pfizer, 5, 8, Roche, 5, 8; M. Hussain, None; Y. Ali, None; E. Nahar, None; A. Alenizi, None; S. Hayat, None; F. Abutiban, None; A. Aldei, None; A. Alkadi, None; H. Alhajeri, None; H. Behbehani, None; N. Alhadhood, None; K. Mokaddem, None; A. Khadrawy, None; A. Fazal, None; A. Zaman, None; G. Mazloum, None; Y. Bartella, None; S. Hamed, None; R. Alsouk, None.

To cite this abstract in AMA style:

Al-Saber A, Al-Herz A, Pan J, Saleh K, Al-Awadhi A, Al-Kandari W, Hasan E, Ghanem A, Hussain M, Ali Y, Nahar E, Alenizi A, Hayat S, Abutiban F, Aldei A, Alkadi A, Alhajeri H, Behbehani H, Alhadhood N, Mokaddem K, Khadrawy A, Fazal A, Zaman A, Mazloum G, Bartella Y, Hamed S, Alsouk R. Missing Data and Multiple Imputation in Rheumatoid Arthritis Registries Using Sequential Random Forest Method [abstract]. Arthritis Rheumatol. 2020; 72 (suppl 10). https://acrabstracts.org/abstract/missing-data-and-multiple-imputation-in-rheumatoid-arthritis-registries-using-sequential-random-forest-method/. Accessed .

« Back to ACR Convergence 2020

ACR Meeting Abstracts - https://acrabstracts.org/abstract/missing-data-and-multiple-imputation-in-rheumatoid-arthritis-registries-using-sequential-random-forest-method/