Collinear datasets augmentation using Procrustes validation sets

Sergey Kucheryavskiy*, Sergei Zhilin

*Kontaktforfatter

Publikation: Bidrag til bog/antologi/rapport/konference proceedingKonferenceabstrakt i proceedingForskning

Abstract

Procrustes Cross-Validation (PCV) is a new validation method recently proposed for validation of a wide range of chemometric models, including PCA/SIMCA, PCR and PLS [1, 2]. PCV employs conventional cross-validation (CCV) to estimate sampling error and then adds this error into the calibration set, which results in a new dataset – Procrustes validation set (PV-set). PV-set can then be used for validation of global models in the same way as the independent validation set. PCV also provides various diagnostic tools that can be used to assess the dataset quality and optimize the splitting strategy.

In this presentation, we will show a new application of PCV — data augmentation [3]. Data augmentation is a way to artificially increase the calibration set by generating new points from the existing data. This is exactly what PCV does. By using random splits, it is possible to create very large number of unique PV-sets, which, being merged with the original dataset can significantly improve the performance of complex machine learning models, that have large number of hyperparameters, such as artificial neural networks (ANN).

One of the advantages of PCV over other data augmentation methods is that the PV-set has similar variance-covariance structure as the calibration set as if both comprise the same population. This makes it particularly efficient for augmentation of datasets with high degree of collinearity, such as e.g. spectral data. Preliminary tests have shown that PCV based augmentation can decrease root mean squared error of prediction of ANN regression models (computed using independent test set) by several times.

References
[1] Kucheryavskiy S, Zhilin S, Rodionova O, Pomerantsev A. Anal. Chem. 92 (2020) 11842–11850
[2] Kucheryavskiy S, Rodionova O, Pomerantsev A. Anal. Chim. Acta. 1255 (2023)
[3] Kucheryavskiy S. Zhilin S. arXiv Preprint DOI: 10.48550/arXiv.2312.04911
OriginalsprogEngelsk
TitelXIX CAC 2024 Chemometrics in Analytical Chemistry : Book of Abstracts
Antal sider1
Publikationsdato2024
ISBN (Elektronisk)9789876924085
StatusUdgivet - 2024
BegivenhedChemometrics in Analytical Chemistry - UNIVERSIDAD NACIONAL DEL LITORAL, Santa Fe, Argentina
Varighed: 9 sep. 202412 sep. 2024
Konferencens nummer: XIX
https://www.fbcb.unl.edu.ar/cac2024/

Konference

KonferenceChemometrics in Analytical Chemistry
NummerXIX
Lokation UNIVERSIDAD NACIONAL DEL LITORAL
Land/OmrådeArgentina
BySanta Fe
Periode09/09/202412/09/2024
Internetadresse

Fingeraftryk

Dyk ned i forskningsemnerne om 'Collinear datasets augmentation using Procrustes validation sets'. Sammen danner de et unikt fingeraftryk.

Citationsformater