Abstract
Describing images using structured data enables a wide range of automation tasks, such as search and organization, as well as downstream tasks, such as labeling images or training machine learning models. However, there is currently a lack of structured data labels for large image repositories such as Wikimedia Commons. To close this gap, we propose the task of Visual Entity Linking (VEL) for Wikimedia Commons, which involves predicting labels for Wikimedia Commons images based on Wikidata items as the label inventory. We create a novel dataset leveraging community-created structured data on Wikimedia Commons. Additionally, we fine-tune pre-trained models based on the CLIP architecture using this dataset. Although the best-performing models show promising results, the study also identifies key challenges of the data and the task.
Originalsprog | Engelsk |
---|---|
Titel | ALVR 2024 - 3rd Workshop on Advances in Language and Vision Research, Proceedings of the Workshop |
Redaktører | Jing Gu, Tsu-Jui Fu, Tsu-Jui Fu, Drew Hudson, Asli Celikyilmaz, William Wang |
Antal sider | 9 |
Forlag | Association for Computational Linguistics |
Publikationsdato | 2024 |
Sider | 186-194 |
ISBN (Elektronisk) | 9798891761537 |
DOI | |
Status | Udgivet - 2024 |
Begivenhed | 3rd Workshop on Advances in Language and Vision Research, ALVR 2024 - Bangkok, Thailand Varighed: 16 aug. 2024 → … |
Konference
Konference | 3rd Workshop on Advances in Language and Vision Research, ALVR 2024 |
---|---|
Land/Område | Thailand |
By | Bangkok |
Periode | 16/08/2024 → … |
Bibliografisk note
Publisher Copyright:© 2024 Association for Computational Linguistics.