TY - JOUR
T1 - Deep Learning–Based COVID-19 Pneumonia Classification Using Chest CT Images
T2 - Model Generalizability
AU - Nguyen, Dan
AU - Kay, Fernando
AU - Tan, Jun
AU - Yan, Yulong
AU - Ng, Yee Seng
AU - Iyengar, Puneeth
AU - Peshock, Ron
AU - Jiang, Steve
N1 - Funding Information:
We would like to thank Jonathan Feinberg for editing the manuscript.
Publisher Copyright:
© Copyright © 2021 Nguyen, Kay, Tan, Yan, Ng, Iyengar, Peshock and Jiang.
PY - 2021/6/29
Y1 - 2021/6/29
N2 - Since the outbreak of the COVID-19 pandemic, worldwide research efforts have focused on using artificial intelligence (AI) technologies on various medical data of COVID-19–positive patients in order to identify or classify various aspects of the disease, with promising reported results. However, concerns have been raised over their generalizability, given the heterogeneous factors in training datasets. This study aims to examine the severity of this problem by evaluating deep learning (DL) classification models trained to identify COVID-19–positive patients on 3D computed tomography (CT) datasets from different countries. We collected one dataset at UT Southwestern (UTSW) and three external datasets from different countries: CC-CCII Dataset (China), COVID-CTset (Iran), and MosMedData (Russia). We divided the data into two classes: COVID-19–positive and COVID-19–negative patients. We trained nine identical DL-based classification models by using combinations of datasets with a 72% train, 8% validation, and 20% test data split. The models trained on a single dataset achieved accuracy/area under the receiver operating characteristic curve (AUC) values of 0.87/0.826 (UTSW), 0.97/0.988 (CC-CCCI), and 0.86/0.873 (COVID-CTset) when evaluated on their own dataset. The models trained on multiple datasets and evaluated on a test set from one of the datasets used for training performed better. However, the performance dropped close to an AUC of 0.5 (random guess) for all models when evaluated on a different dataset outside of its training datasets. Including MosMedData, which only contained positive labels, into the training datasets did not necessarily help the performance of other datasets. Multiple factors likely contributed to these results, such as patient demographics and differences in image acquisition or reconstruction, causing a data shift among different study cohorts.
AB - Since the outbreak of the COVID-19 pandemic, worldwide research efforts have focused on using artificial intelligence (AI) technologies on various medical data of COVID-19–positive patients in order to identify or classify various aspects of the disease, with promising reported results. However, concerns have been raised over their generalizability, given the heterogeneous factors in training datasets. This study aims to examine the severity of this problem by evaluating deep learning (DL) classification models trained to identify COVID-19–positive patients on 3D computed tomography (CT) datasets from different countries. We collected one dataset at UT Southwestern (UTSW) and three external datasets from different countries: CC-CCII Dataset (China), COVID-CTset (Iran), and MosMedData (Russia). We divided the data into two classes: COVID-19–positive and COVID-19–negative patients. We trained nine identical DL-based classification models by using combinations of datasets with a 72% train, 8% validation, and 20% test data split. The models trained on a single dataset achieved accuracy/area under the receiver operating characteristic curve (AUC) values of 0.87/0.826 (UTSW), 0.97/0.988 (CC-CCCI), and 0.86/0.873 (COVID-CTset) when evaluated on their own dataset. The models trained on multiple datasets and evaluated on a test set from one of the datasets used for training performed better. However, the performance dropped close to an AUC of 0.5 (random guess) for all models when evaluated on a different dataset outside of its training datasets. Including MosMedData, which only contained positive labels, into the training datasets did not necessarily help the performance of other datasets. Multiple factors likely contributed to these results, such as patient demographics and differences in image acquisition or reconstruction, causing a data shift among different study cohorts.
KW - COVID-19
KW - SARS-CoV-2
KW - classification
KW - computed tomography
KW - convolutional neural network
KW - deep learning
KW - generalizability
UR - http://www.scopus.com/inward/record.url?scp=85117060707&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85117060707&partnerID=8YFLogxK
U2 - 10.3389/frai.2021.694875
DO - 10.3389/frai.2021.694875
M3 - Article
C2 - 34268489
AN - SCOPUS:85117060707
SN - 2624-8212
VL - 4
JO - Frontiers in Artificial Intelligence
JF - Frontiers in Artificial Intelligence
M1 - 694875
ER -