Comparison of imputation methods for missing laboratory data in medicine

Akbar K. Waljee; Ashin Mukherjee; Amit G. Singal; Yiwei Zhang; Jeffrey Warren; Ulysses Balis; Jorge Marrero; Ji Zhu; Peter D R Higgins

doi:10.1136/bmjopen-2013-002847

Comparison of imputation methods for missing laboratory data in medicine

Akbar K. Waljee, Ashin Mukherjee, Amit G. Singal, Yiwei Zhang, Jeffrey Warren, Ulysses Balis, Jorge Marrero, Ji Zhu, Peter D R Higgins

Research output: Contribution to journal › Article › peer-review

286 Scopus citations

Abstract

Objectives: Missing laboratory data is a common issue, but the optimal method of imputation of missing values has not been determined. The aims of our study were to compare the accuracy of four imputation methods for missing completely at random laboratory data and to compare the effect of the imputed values on the accuracy of two clinical predictive models. Design: Retrospective cohort analysis of two large data sets. Setting: A tertiary level care institution in Ann Arbor, Michigan. Participants: The Cirrhosis cohort had 446 patients and the Inflammatory Bowel Disease cohort had 395 patients. Methods: Non-missing laboratory data were randomly removed with varying frequencies from two large data sets, and we then compared the ability of four methods-missForest, mean imputation, nearest neighbour imputation and multivariate imputation by chained equations (MICE)-to impute the simulated missing data. We characterised the accuracy of the imputation and the effect of the imputation on predictive ability in two large data sets. Results: MissForest had the least imputation error for both continuous and categorical variables at each frequency of missingness, and it had the smallest prediction difference when models used imputed laboratory values. In both data sets, MICE had the second least imputation error and prediction difference, followed by the nearest neighbour and mean imputation. Conclusions: MissForest is a highly accurate method of imputation for missing laboratory data and outperforms other common imputation techniques in terms of imputation error and maintenance of predictive ability with imputed values in two clinical predicative models.

Original language	English (US)
Article number	e002847
Journal	BMJ Open
Volume	3
Issue number	8
DOIs	https://doi.org/10.1136/bmjopen-2013-002847
State	Published - 2013

ASJC Scopus subject areas

General Medicine

Access to Document

10.1136/bmjopen-2013-002847

Cite this

@article{6faaaa941033477ab37247842ad5cfd9,

title = "Comparison of imputation methods for missing laboratory data in medicine",

abstract = "Objectives: Missing laboratory data is a common issue, but the optimal method of imputation of missing values has not been determined. The aims of our study were to compare the accuracy of four imputation methods for missing completely at random laboratory data and to compare the effect of the imputed values on the accuracy of two clinical predictive models. Design: Retrospective cohort analysis of two large data sets. Setting: A tertiary level care institution in Ann Arbor, Michigan. Participants: The Cirrhosis cohort had 446 patients and the Inflammatory Bowel Disease cohort had 395 patients. Methods: Non-missing laboratory data were randomly removed with varying frequencies from two large data sets, and we then compared the ability of four methods-missForest, mean imputation, nearest neighbour imputation and multivariate imputation by chained equations (MICE)-to impute the simulated missing data. We characterised the accuracy of the imputation and the effect of the imputation on predictive ability in two large data sets. Results: MissForest had the least imputation error for both continuous and categorical variables at each frequency of missingness, and it had the smallest prediction difference when models used imputed laboratory values. In both data sets, MICE had the second least imputation error and prediction difference, followed by the nearest neighbour and mean imputation. Conclusions: MissForest is a highly accurate method of imputation for missing laboratory data and outperforms other common imputation techniques in terms of imputation error and maintenance of predictive ability with imputed values in two clinical predicative models.",

author = "Waljee, {Akbar K.} and Ashin Mukherjee and Singal, {Amit G.} and Yiwei Zhang and Jeffrey Warren and Ulysses Balis and Jorge Marrero and Ji Zhu and Higgins, {Peter D R}",

year = "2013",

doi = "10.1136/bmjopen-2013-002847",

language = "English (US)",

volume = "3",

journal = "BMJ Open",

issn = "2044-6055",

publisher = "BMJ Publishing Group",

number = "8",

}

TY - JOUR

T1 - Comparison of imputation methods for missing laboratory data in medicine

AU - Waljee, Akbar K.

AU - Mukherjee, Ashin

AU - Singal, Amit G.

AU - Zhang, Yiwei

AU - Warren, Jeffrey

AU - Balis, Ulysses

AU - Marrero, Jorge

AU - Zhu, Ji

AU - Higgins, Peter D R

PY - 2013

Y1 - 2013

N2 - Objectives: Missing laboratory data is a common issue, but the optimal method of imputation of missing values has not been determined. The aims of our study were to compare the accuracy of four imputation methods for missing completely at random laboratory data and to compare the effect of the imputed values on the accuracy of two clinical predictive models. Design: Retrospective cohort analysis of two large data sets. Setting: A tertiary level care institution in Ann Arbor, Michigan. Participants: The Cirrhosis cohort had 446 patients and the Inflammatory Bowel Disease cohort had 395 patients. Methods: Non-missing laboratory data were randomly removed with varying frequencies from two large data sets, and we then compared the ability of four methods-missForest, mean imputation, nearest neighbour imputation and multivariate imputation by chained equations (MICE)-to impute the simulated missing data. We characterised the accuracy of the imputation and the effect of the imputation on predictive ability in two large data sets. Results: MissForest had the least imputation error for both continuous and categorical variables at each frequency of missingness, and it had the smallest prediction difference when models used imputed laboratory values. In both data sets, MICE had the second least imputation error and prediction difference, followed by the nearest neighbour and mean imputation. Conclusions: MissForest is a highly accurate method of imputation for missing laboratory data and outperforms other common imputation techniques in terms of imputation error and maintenance of predictive ability with imputed values in two clinical predicative models.

AB - Objectives: Missing laboratory data is a common issue, but the optimal method of imputation of missing values has not been determined. The aims of our study were to compare the accuracy of four imputation methods for missing completely at random laboratory data and to compare the effect of the imputed values on the accuracy of two clinical predictive models. Design: Retrospective cohort analysis of two large data sets. Setting: A tertiary level care institution in Ann Arbor, Michigan. Participants: The Cirrhosis cohort had 446 patients and the Inflammatory Bowel Disease cohort had 395 patients. Methods: Non-missing laboratory data were randomly removed with varying frequencies from two large data sets, and we then compared the ability of four methods-missForest, mean imputation, nearest neighbour imputation and multivariate imputation by chained equations (MICE)-to impute the simulated missing data. We characterised the accuracy of the imputation and the effect of the imputation on predictive ability in two large data sets. Results: MissForest had the least imputation error for both continuous and categorical variables at each frequency of missingness, and it had the smallest prediction difference when models used imputed laboratory values. In both data sets, MICE had the second least imputation error and prediction difference, followed by the nearest neighbour and mean imputation. Conclusions: MissForest is a highly accurate method of imputation for missing laboratory data and outperforms other common imputation techniques in terms of imputation error and maintenance of predictive ability with imputed values in two clinical predicative models.

UR - http://www.scopus.com/inward/record.url?scp=84884547259&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84884547259&partnerID=8YFLogxK

U2 - 10.1136/bmjopen-2013-002847

DO - 10.1136/bmjopen-2013-002847

M3 - Article

C2 - 23906948

AN - SCOPUS:84884547259

SN - 2044-6055

VL - 3

JO - BMJ Open

JF - BMJ Open

IS - 8

M1 - e002847

ER -

Comparison of imputation methods for missing laboratory data in medicine

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this