BarcodeBERT: Transformers for Biodiversity Analysis

Pablo Millan Arias, Niousha Sadjadi, Monireh Safari, ZeMing Gong, Austin T. Wang, Scott C. Lowe, Joakim Bruslund Haurum, Iuliia Zarubiieva, Dirk Steinke, Lila Kari, Angel X. Chang, Graham W. Taylor

Publikation: Konferencebidrag uden forlag/tidsskriftPaper uden forlag/tidsskriftForskningpeer review

Abstract

Understanding biodiversity is a global challenge, in which DNA barcodes—short
snippets of DNA that cluster by species—play a pivotal role. In particular, invertebrates, a highly diverse and under-explored group, pose unique taxonomic complexities. We explore machine learning approaches, comparing supervised CNNs, fine-tuned foundation models, and a DNA barcode-specific masking strategy across datasets of varying complexity. While simpler datasets and tasks favor supervised CNNs or fine-tuned transformers, challenging species-level identification demands a paradigm shift towards self-supervised pretraining. We propose BarcodeBERT, the first self-supervised method for general biodiversity analysis, leveraging a 1.5 M invertebrate DNA barcode reference library. This work highlights how dataset specifics and coverage impact model selection, and underscores the role of self-supervised pretraining in achieving high-accuracy DNA barcode-based identification at the species and genus level. Indeed, without the fine-tuning step, BarcodeBERT pretrained on a large DNA barcode dataset outperforms DNABERT and DNABERT-2 on multiple downstream classification tasks. The code repository is available at https://github.com/Kari-Genomics-Lab/BarcodeBERT
OriginalsprogEngelsk
Publikationsdato2023
Antal sider9
StatusUdgivet - 2023
Begivenhed4th Workshop on Self-Supervised Learning: Theory and Practice (NeurIPS 2023) - New Orleans, USA
Varighed: 16 dec. 202316 dec. 2023
https://sslneurips23.github.io/

Workshop

Workshop4th Workshop on Self-Supervised Learning: Theory and Practice (NeurIPS 2023)
Land/OmrådeUSA
ByNew Orleans
Periode16/12/202316/12/2023
Internetadresse

Fingeraftryk

Dyk ned i forskningsemnerne om 'BarcodeBERT: Transformers for Biodiversity Analysis'. Sammen danner de et unikt fingeraftryk.

Citationsformater