TY - JOUR
T1 - Representative splitting cross validation
AU - Xu, Lu
AU - Hu, Ou
AU - Guo, Yuwan
AU - Zhang, Mengqin
AU - Lu, Daowang
AU - Cai, Chen Bo
AU - Xie, Shunping
AU - Goodarzi, Mohammad
AU - Fu, Hai Yan
AU - She, Yuan Bin
N1 - Funding Information:
Authors are grateful to the financial support from the National Natural Science Foundation of China (Grants Nos. 21665022, 21665002, 21776321, 21576297, 21706233, 21476270 ), Key Projects of Technological Innovation of Hubei Province ( 2016ACA138 ), the Open Research Program (Nos. 2015ZD001, 2015ZD002 and 2015ZY006 ) from the Modernization Engineering Technology Research Center of Ethnic Minority Medicine of Hubei province (South-Central University for Nationalities), and The Talented Youth Cultivation Program from “the Fundamental Research Funds for the Central Universities”, South-Central University for Nationalities (No. CRZ18002 ). Lu Xu is financially supported by Provincial Key Disciplines of Chemical Engineering and Technology in Guizhou Province (No. ZDXK[2017]8 ), Guizhou Engineering Research Center ( QJHKYZ [2017]024 ), Guizhou Provincial Science and Technology Department (No. QKHJC[2017]1186 ), and the Talented Researcher Program from Guizhou Provincial Department of Education ( QJHKYZ[2018]073 ).
Funding Information:
Authors are grateful to the financial support from the National Natural Science Foundation of China (Grants Nos. 21665022, 21665002, 21776321, 21576297, 21706233, 21476270), Key Projects of Technological Innovation of Hubei Province (2016ACA138), the Open Research Program (Nos. 2015ZD001, 2015ZD002 and 2015ZY006) from the Modernization Engineering Technology Research Center of Ethnic Minority Medicine of Hubei province (South-Central University for Nationalities), and The Talented Youth Cultivation Program from “the Fundamental Research Funds for the Central Universities” South-Central University for Nationalities (No. CRZ18002). Lu Xu is financially supported by Provincial Key Disciplines of Chemical Engineering and Technology in Guizhou Province (No. ZDXK[2017]8), Guizhou Engineering Research Center (QJHKYZ [2017]024), Guizhou Provincial Science and Technology Department (No. QKHJC[2017]1186), and the Talented Researcher Program from Guizhou Provincial Department of Education (QJHKYZ[2018]073).
Publisher Copyright:
© 2018 Elsevier B.V.
PY - 2018/12/15
Y1 - 2018/12/15
N2 - Cross-validation (CV) is widely used to estimate model complexity or the number of significant latent variables (LVs) for multivariate calibration methods like partial least squares (PLS). A basic consideration when developing and validating multivariate calibration models is that both the training and validation sets should be representative and distributed in the experimental space as uniformly as possible. Motivated by this idea, we proposed a new CV method called representative splitting cross-validation (RSCV). In RSCV, firstly, the DUPLEX algorithm was used to sequentially divide the original training set into k (in this work, k = 2, 4, 8 and 16) equal parts. Secondly, a series of k-fold (k = 2, 4, 8 and 16) CVs were performed based on the above data splitting. Finally, the pooled root mean squared error of CV (RMSECV) was used to estimate model complexity. Five real multivariate calibration data sets were investigated and RSCV was compared with leave-one-out CV (LOOCV), 10-fold CV and Monte Carlo CV (MCCV). With a maximum k of 16, RSCV was shown to be a useful and stable method to select PLS LVs, and can obtain simpler models with acceptable computational burden.
AB - Cross-validation (CV) is widely used to estimate model complexity or the number of significant latent variables (LVs) for multivariate calibration methods like partial least squares (PLS). A basic consideration when developing and validating multivariate calibration models is that both the training and validation sets should be representative and distributed in the experimental space as uniformly as possible. Motivated by this idea, we proposed a new CV method called representative splitting cross-validation (RSCV). In RSCV, firstly, the DUPLEX algorithm was used to sequentially divide the original training set into k (in this work, k = 2, 4, 8 and 16) equal parts. Secondly, a series of k-fold (k = 2, 4, 8 and 16) CVs were performed based on the above data splitting. Finally, the pooled root mean squared error of CV (RMSECV) was used to estimate model complexity. Five real multivariate calibration data sets were investigated and RSCV was compared with leave-one-out CV (LOOCV), 10-fold CV and Monte Carlo CV (MCCV). With a maximum k of 16, RSCV was shown to be a useful and stable method to select PLS LVs, and can obtain simpler models with acceptable computational burden.
KW - Cross-validation (CV)
KW - Model complexity
KW - Multivariate calibration
KW - Partial least squares (PLS)
KW - Representative splitting cross-validation (RSCV)
UR - http://www.scopus.com/inward/record.url?scp=85056180205&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85056180205&partnerID=8YFLogxK
U2 - 10.1016/j.chemolab.2018.10.008
DO - 10.1016/j.chemolab.2018.10.008
M3 - Article
AN - SCOPUS:85056180205
SN - 0169-7439
VL - 183
SP - 29
EP - 35
JO - Chemometrics and Intelligent Laboratory Systems
JF - Chemometrics and Intelligent Laboratory Systems
ER -