TY - JOUR
T1 - Development of a machine learning framework for radiation biomarker discovery and absorbed dose prediction
AU - Andersson, Björn
AU - Langen, Britta
AU - Liu, Peidi
AU - Dávila López, Marcela
N1 - Publisher Copyright:
Copyright © 2023 Andersson, Langen, Liu and Dávila López.
PY - 2023
Y1 - 2023
N2 - Background: Molecular radiation biomarkers are an emerging tool in radiation research with applications for cancer radiotherapy, radiation risk assessment, and even human space travel. However, biomarker screening in genome-wide expression datasets using conventional tools is time-consuming and underlies analyst (human) bias. Machine Learning (ML) methods can improve the sensitivity and specificity of biomarker identification, increase analytical speed, and avoid multicollinearity and human bias. Aim: To develop a resource-efficient ML framework for radiation biomarker discovery using gene expression data from irradiated normal tissues. Further, to identify biomarker panels predicting radiation dose with tissue specificity. Methods: A strategic search in the Gene Expression Omnibus database identified a transcriptomic dataset (GSE44762) for normal tissues radiation responses (murine kidney cortex and medulla) suited for biomarker discovery using an ML approach. The dataset was pre-processed in R and separated into train and test data subsets. High computational cost of Genetic Algorithm/k-Nearest Neighbor (GA/KNN) mandated optimization and 13 ML models were tested using the caret package in R. Biomarker performance was evaluated and visualized via Principal Component Analysis (PCA) and dose regression. The novelty of ML-identified biomarker panels was evaluated by literature search. Results: Caret-based feature selection and ML methods vastly improved processing time over the GA approach. The KNN method yielded overall best performance values on train and test data and was implemented into the framework. The top-ranking genes were Cdkn1a, Gria3, Mdm2 and Plk2 in cortex, and Brf2, Ccng1, Cdkn1a, Ddit4l, and Gria3 in medulla. These candidates successfully categorized dose groups and tissues in PCA. Regression analysis showed that correlation between predicted and true dose was high with R2 of 0.97 and 0.99 for cortex and medulla, respectively. Conclusion: The caret framework is a powerful tool for radiation biomarker discovery optimizing performance with resource-efficiency for broad implementation in the field. The KNN-based approach identified Brf2, Ddit4l, and Gria3 mRNA as novel candidates that have been uncharacterized as radiation biomarkers to date. The biomarker panel showed good performance in dose and tissue separation and dose regression. Further training with larger cohorts is warranted to improve accuracy, especially for lower doses.
AB - Background: Molecular radiation biomarkers are an emerging tool in radiation research with applications for cancer radiotherapy, radiation risk assessment, and even human space travel. However, biomarker screening in genome-wide expression datasets using conventional tools is time-consuming and underlies analyst (human) bias. Machine Learning (ML) methods can improve the sensitivity and specificity of biomarker identification, increase analytical speed, and avoid multicollinearity and human bias. Aim: To develop a resource-efficient ML framework for radiation biomarker discovery using gene expression data from irradiated normal tissues. Further, to identify biomarker panels predicting radiation dose with tissue specificity. Methods: A strategic search in the Gene Expression Omnibus database identified a transcriptomic dataset (GSE44762) for normal tissues radiation responses (murine kidney cortex and medulla) suited for biomarker discovery using an ML approach. The dataset was pre-processed in R and separated into train and test data subsets. High computational cost of Genetic Algorithm/k-Nearest Neighbor (GA/KNN) mandated optimization and 13 ML models were tested using the caret package in R. Biomarker performance was evaluated and visualized via Principal Component Analysis (PCA) and dose regression. The novelty of ML-identified biomarker panels was evaluated by literature search. Results: Caret-based feature selection and ML methods vastly improved processing time over the GA approach. The KNN method yielded overall best performance values on train and test data and was implemented into the framework. The top-ranking genes were Cdkn1a, Gria3, Mdm2 and Plk2 in cortex, and Brf2, Ccng1, Cdkn1a, Ddit4l, and Gria3 in medulla. These candidates successfully categorized dose groups and tissues in PCA. Regression analysis showed that correlation between predicted and true dose was high with R2 of 0.97 and 0.99 for cortex and medulla, respectively. Conclusion: The caret framework is a powerful tool for radiation biomarker discovery optimizing performance with resource-efficiency for broad implementation in the field. The KNN-based approach identified Brf2, Ddit4l, and Gria3 mRNA as novel candidates that have been uncharacterized as radiation biomarkers to date. The biomarker panel showed good performance in dose and tissue separation and dose regression. Further training with larger cohorts is warranted to improve accuracy, especially for lower doses.
KW - absorbed dose
KW - biomarkers
KW - caret
KW - ionizing radiation
KW - kNN (k nearest neighbor)
KW - radionuclides
KW - transcriptomics
UR - http://www.scopus.com/inward/record.url?scp=85161077698&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85161077698&partnerID=8YFLogxK
U2 - 10.3389/fonc.2023.1156009
DO - 10.3389/fonc.2023.1156009
M3 - Article
C2 - 37256187
AN - SCOPUS:85161077698
SN - 2234-943X
VL - 13
JO - Frontiers in Oncology
JF - Frontiers in Oncology
M1 - 1156009
ER -