Variance Analysis of LC-MS Experimental Factors and Their Impact on Machine Learning

Tobias Greisager Rehfeldt; Konrad Krawczyk; Simon Gregersen Echers; Paolo Marcatili; Pawel Palczynski; Richard Röttger; Veit Schwämmle

doi:10.1101/2023.05.01.538996

Variance Analysis of LC-MS Experimental Factors and Their Impact on Machine Learning

Tobias Greisager Rehfeldt, Konrad Krawczyk, Simon Gregersen Echers, Paolo Marcatili, Pawel Palczynski, Richard Röttger, Veit Schwämmle^*

^*Kontaktforfatter

Publikation: Working paper/Preprint › Preprint

Abstract

Background Machine learning (ML) technologies, especially deep learning (DL), have gained increasing attention in predictive mass spectrometry (MS) for enhancing the data processing pipeline from raw data analysis to end-user predictions and re-scoring. ML models need large-scale datasets for training and re-purposing, which can be obtained from a range of public data repositories. However, applying ML to public MS datasets on larger scales is challenging, as they vary widely in terms of data acquisition methods, biological systems, and experimental designs.

Results We aim to facilitate ML efforts in MS data by conducting a systematic analysis of the potential sources of variance in public MS repositories. We also examine how these factors affect ML performance and perform a comprehensive transfer learning to evaluate the benefits of current best practice methods in the field for transfer learning.

Conclusions Our findings show significantly higher levels of homogeneity within a project than between projects, which indicates that it’s important to construct datasets most closely resembling future test cases, as transferability is severely limited for unseen datasets. We also found that transfer learning, although it did increase model performance, did not increase model performance compared to a non-pre-trained model.

Originalsprog	Engelsk
Udgiver	bioRxiv
Antal sider	29
DOI	https://doi.org/10.1101/2023.05.01.538996
Status	Udgivet - 2 maj 2023

FN’s Verdensmål

Denne publikation bidrager til følgende verdensmål

Adgang til dokumentet

10.1101/2023.05.01.538996Licens: CC BY 4.0

AUB Link

Søg efter materialet i Aalborg Universitetsbiblioteks søgemaskine

Q-BIOPEP: KVANTIFICERING AF BIOAKTIVE FØDEVAREPEPTIDER FRA KARTOFFELPROTEIN
Gregersen, S., Wimmer, R. & Abdul Khalek Gharzeddine, N.
Karl Pedersen og Hustrus Industrifond
15/06/2021 → 14/06/2024
Projekter: Projekt › Forskning

Citationsformater

@techreport{6a9d28a0579d4b78af17a45fee64ae3c,

title = "Variance Analysis of LC-MS Experimental Factors and Their Impact on Machine Learning",

abstract = "Background Machine learning (ML) technologies, especially deep learning (DL), have gained increasing attention in predictive mass spectrometry (MS) for enhancing the data processing pipeline from raw data analysis to end-user predictions and re-scoring. ML models need large-scale datasets for training and re-purposing, which can be obtained from a range of public data repositories. However, applying ML to public MS datasets on larger scales is challenging, as they vary widely in terms of data acquisition methods, biological systems, and experimental designs.Results We aim to facilitate ML efforts in MS data by conducting a systematic analysis of the potential sources of variance in public MS repositories. We also examine how these factors affect ML performance and perform a comprehensive transfer learning to evaluate the benefits of current best practice methods in the field for transfer learning.Conclusions Our findings show significantly higher levels of homogeneity within a project than between projects, which indicates that it{\textquoteright}s important to construct datasets most closely resembling future test cases, as transferability is severely limited for unseen datasets. We also found that transfer learning, although it did increase model performance, did not increase model performance compared to a non-pre-trained model.",

author = "Rehfeldt, {Tobias Greisager} and Konrad Krawczyk and Echers, {Simon Gregersen} and Paolo Marcatili and Pawel Palczynski and Richard R{\"o}ttger and Veit Schw{\"a}mmle",

year = "2023",

month = may,

day = "2",

doi = "10.1101/2023.05.01.538996",

language = "English",

publisher = "bioRxiv",

type = "WorkingPaper",

institution = "bioRxiv",

}

TY - UNPB

T1 - Variance Analysis of LC-MS Experimental Factors and Their Impact on Machine Learning

AU - Rehfeldt, Tobias Greisager

AU - Krawczyk, Konrad

AU - Echers, Simon Gregersen

AU - Marcatili, Paolo

AU - Palczynski, Pawel

AU - Röttger, Richard

AU - Schwämmle, Veit

PY - 2023/5/2

Y1 - 2023/5/2

N2 - Background Machine learning (ML) technologies, especially deep learning (DL), have gained increasing attention in predictive mass spectrometry (MS) for enhancing the data processing pipeline from raw data analysis to end-user predictions and re-scoring. ML models need large-scale datasets for training and re-purposing, which can be obtained from a range of public data repositories. However, applying ML to public MS datasets on larger scales is challenging, as they vary widely in terms of data acquisition methods, biological systems, and experimental designs.Results We aim to facilitate ML efforts in MS data by conducting a systematic analysis of the potential sources of variance in public MS repositories. We also examine how these factors affect ML performance and perform a comprehensive transfer learning to evaluate the benefits of current best practice methods in the field for transfer learning.Conclusions Our findings show significantly higher levels of homogeneity within a project than between projects, which indicates that it’s important to construct datasets most closely resembling future test cases, as transferability is severely limited for unseen datasets. We also found that transfer learning, although it did increase model performance, did not increase model performance compared to a non-pre-trained model.

AB - Background Machine learning (ML) technologies, especially deep learning (DL), have gained increasing attention in predictive mass spectrometry (MS) for enhancing the data processing pipeline from raw data analysis to end-user predictions and re-scoring. ML models need large-scale datasets for training and re-purposing, which can be obtained from a range of public data repositories. However, applying ML to public MS datasets on larger scales is challenging, as they vary widely in terms of data acquisition methods, biological systems, and experimental designs.Results We aim to facilitate ML efforts in MS data by conducting a systematic analysis of the potential sources of variance in public MS repositories. We also examine how these factors affect ML performance and perform a comprehensive transfer learning to evaluate the benefits of current best practice methods in the field for transfer learning.Conclusions Our findings show significantly higher levels of homogeneity within a project than between projects, which indicates that it’s important to construct datasets most closely resembling future test cases, as transferability is severely limited for unseen datasets. We also found that transfer learning, although it did increase model performance, did not increase model performance compared to a non-pre-trained model.

U2 - 10.1101/2023.05.01.538996

DO - 10.1101/2023.05.01.538996

M3 - Preprint

BT - Variance Analysis of LC-MS Experimental Factors and Their Impact on Machine Learning

PB - bioRxiv

ER -

Variance Analysis of LC-MS Experimental Factors and Their Impact on Machine Learning

Abstract

FN’s Verdensmål

Adgang til dokumentet

AUB Link

Fingeraftryk

Projekter

Q-BIOPEP: KVANTIFICERING AF BIOAKTIVE FØDEVAREPEPTIDER FRA KARTOFFELPROTEIN

Citationsformater