BarcodeBERT: Transformers for Biodiversity Analysis

Pablo Millan Arias; Niousha Sadjadi; Monireh Safari; ZeMing Gong; Austin T. Wang; Scott C. Lowe; Joakim Bruslund Haurum; Iuliia Zarubiieva; Dirk Steinke; Lila Kari; Angel X. Chang; Graham W. Taylor

BarcodeBERT: Transformers for Biodiversity Analysis

Pablo Millan Arias, Niousha Sadjadi, Monireh Safari, ZeMing Gong, Austin T. Wang, Scott C. Lowe, Joakim Bruslund Haurum, Iuliia Zarubiieva, Dirk Steinke, Lila Kari, Angel X. Chang, Graham W. Taylor

Publikation: Konferencebidrag uden forlag/tidsskrift › Paper uden forlag/tidsskrift › Forskning › peer review

Abstract

Understanding biodiversity is a global challenge, in which DNA barcodes—short
snippets of DNA that cluster by species—play a pivotal role. In particular, invertebrates, a highly diverse and under-explored group, pose unique taxonomic complexities. We explore machine learning approaches, comparing supervised CNNs, fine-tuned foundation models, and a DNA barcode-specific masking strategy across datasets of varying complexity. While simpler datasets and tasks favor supervised CNNs or fine-tuned transformers, challenging species-level identification demands a paradigm shift towards self-supervised pretraining. We propose BarcodeBERT, the first self-supervised method for general biodiversity analysis, leveraging a 1.5 M invertebrate DNA barcode reference library. This work highlights how dataset specifics and coverage impact model selection, and underscores the role of self-supervised pretraining in achieving high-accuracy DNA barcode-based identification at the species and genus level. Indeed, without the fine-tuning step, BarcodeBERT pretrained on a large DNA barcode dataset outperforms DNABERT and DNABERT-2 on multiple downstream classification tasks. The code repository is available at https://github.com/Kari-Genomics-Lab/BarcodeBERT

Originalsprog	Engelsk
Publikationsdato	2023
Antal sider	9
Status	Udgivet - 2023
Begivenhed	4th Workshop on Self-Supervised Learning: Theory and Practice (NeurIPS 2023) - New Orleans, USA Varighed: 16 dec. 2023 → 16 dec. 2023 https://sslneurips23.github.io/

Workshop

Workshop	4th Workshop on Self-Supervised Learning: Theory and Practice (NeurIPS 2023)
Land/Område	USA
By	New Orleans
Periode	16/12/2023 → 16/12/2023
Internetadresse	https://sslneurips23.github.io/

Adgang til dokumentet

https://arxiv.org/abs/2311.02401Licens: CC BY 4.0

AUB Link

Søg efter materialet i Aalborg Universitetsbiblioteks søgemaskine

Citationsformater

Arias, P. M., Sadjadi, N., Safari, M., Gong, Z., Wang, A. T., Lowe, S. C., Haurum, J. B., Zarubiieva, I., Steinke, D., Kari, L., Chang, A. X., & Taylor, G. W. (2023). BarcodeBERT: Transformers for Biodiversity Analysis. Afhandling præsenteret på 4th Workshop on Self-Supervised Learning: Theory and Practice (NeurIPS 2023), New Orleans, Louisiana, USA. https://arxiv.org/abs/2311.02401

@conference{39be96fb19934b81aabb7a43f42c3d02,

title = "BarcodeBERT: Transformers for Biodiversity Analysis",

abstract = "Understanding biodiversity is a global challenge, in which DNA barcodes—shortsnippets of DNA that cluster by species—play a pivotal role. In particular, invertebrates, a highly diverse and under-explored group, pose unique taxonomic complexities. We explore machine learning approaches, comparing supervised CNNs, fine-tuned foundation models, and a DNA barcode-specific masking strategy across datasets of varying complexity. While simpler datasets and tasks favor supervised CNNs or fine-tuned transformers, challenging species-level identification demands a paradigm shift towards self-supervised pretraining. We propose BarcodeBERT, the first self-supervised method for general biodiversity analysis, leveraging a 1.5 M invertebrate DNA barcode reference library. This work highlights how dataset specifics and coverage impact model selection, and underscores the role of self-supervised pretraining in achieving high-accuracy DNA barcode-based identification at the species and genus level. Indeed, without the fine-tuning step, BarcodeBERT pretrained on a large DNA barcode dataset outperforms DNABERT and DNABERT-2 on multiple downstream classification tasks. The code repository is available at https://github.com/Kari-Genomics-Lab/BarcodeBERT",

keywords = "Biodiversity, DNA Barcode, Transformers, Machine Learning, Deep Learning, Self-supervised, Zero-shot learning, Bayesian",

author = "Arias, {Pablo Millan} and Niousha Sadjadi and Monireh Safari and ZeMing Gong and Wang, {Austin T.} and Lowe, {Scott C.} and Haurum, {Joakim Bruslund} and Iuliia Zarubiieva and Dirk Steinke and Lila Kari and Chang, {Angel X.} and Taylor, {Graham W.}",

year = "2023",

language = "English",

note = "4th Workshop on Self-Supervised Learning: Theory and Practice (NeurIPS 2023), SSL NeurIPS ; Conference date: 16-12-2023 Through 16-12-2023",

url = "https://sslneurips23.github.io/",

}

BarcodeBERT: Transformers for Biodiversity Analysis. / Arias, Pablo Millan; Sadjadi, Niousha; Safari, Monireh et al.
2023. Afhandling præsenteret på 4th Workshop on Self-Supervised Learning: Theory and Practice (NeurIPS 2023), New Orleans, Louisiana, USA.

Publikation: Konferencebidrag uden forlag/tidsskrift › Paper uden forlag/tidsskrift › Forskning › peer review

TY - CONF

T1 - BarcodeBERT: Transformers for Biodiversity Analysis

AU - Arias, Pablo Millan

AU - Sadjadi, Niousha

AU - Safari, Monireh

AU - Gong, ZeMing

AU - Wang, Austin T.

AU - Lowe, Scott C.

AU - Haurum, Joakim Bruslund

AU - Zarubiieva, Iuliia

AU - Steinke, Dirk

AU - Kari, Lila

AU - Chang, Angel X.

AU - Taylor, Graham W.

PY - 2023

Y1 - 2023

N2 - Understanding biodiversity is a global challenge, in which DNA barcodes—shortsnippets of DNA that cluster by species—play a pivotal role. In particular, invertebrates, a highly diverse and under-explored group, pose unique taxonomic complexities. We explore machine learning approaches, comparing supervised CNNs, fine-tuned foundation models, and a DNA barcode-specific masking strategy across datasets of varying complexity. While simpler datasets and tasks favor supervised CNNs or fine-tuned transformers, challenging species-level identification demands a paradigm shift towards self-supervised pretraining. We propose BarcodeBERT, the first self-supervised method for general biodiversity analysis, leveraging a 1.5 M invertebrate DNA barcode reference library. This work highlights how dataset specifics and coverage impact model selection, and underscores the role of self-supervised pretraining in achieving high-accuracy DNA barcode-based identification at the species and genus level. Indeed, without the fine-tuning step, BarcodeBERT pretrained on a large DNA barcode dataset outperforms DNABERT and DNABERT-2 on multiple downstream classification tasks. The code repository is available at https://github.com/Kari-Genomics-Lab/BarcodeBERT

AB - Understanding biodiversity is a global challenge, in which DNA barcodes—shortsnippets of DNA that cluster by species—play a pivotal role. In particular, invertebrates, a highly diverse and under-explored group, pose unique taxonomic complexities. We explore machine learning approaches, comparing supervised CNNs, fine-tuned foundation models, and a DNA barcode-specific masking strategy across datasets of varying complexity. While simpler datasets and tasks favor supervised CNNs or fine-tuned transformers, challenging species-level identification demands a paradigm shift towards self-supervised pretraining. We propose BarcodeBERT, the first self-supervised method for general biodiversity analysis, leveraging a 1.5 M invertebrate DNA barcode reference library. This work highlights how dataset specifics and coverage impact model selection, and underscores the role of self-supervised pretraining in achieving high-accuracy DNA barcode-based identification at the species and genus level. Indeed, without the fine-tuning step, BarcodeBERT pretrained on a large DNA barcode dataset outperforms DNABERT and DNABERT-2 on multiple downstream classification tasks. The code repository is available at https://github.com/Kari-Genomics-Lab/BarcodeBERT

KW - Biodiversity

KW - DNA Barcode

KW - Transformers

KW - Machine Learning

KW - Deep Learning

KW - Self-supervised

KW - Zero-shot learning

KW - Bayesian

M3 - Paper without publisher/journal

T2 - 4th Workshop on Self-Supervised Learning: Theory and Practice (NeurIPS 2023)

Y2 - 16 December 2023 through 16 December 2023

ER -

BarcodeBERT: Transformers for Biodiversity Analysis

Abstract

Workshop

Adgang til dokumentet

AUB Link

Fingeraftryk

Citationsformater