Does the Performance of Text-to-Image Retrieval Models Generalize Beyond Captions-as-a-Query?

Juan Manuel Rodriguez; Nima Tavassoli; Eliezer Levy; Gil Lederman; Dima Sivov; Matteo Lissandrini; Davide Mottin

doi:10.1007/978-3-031-56066-8_15

Does the Performance of Text-to-Image Retrieval Models Generalize Beyond Captions-as-a-Query?

Juan Manuel Rodriguez, Nima Tavassoli, Eliezer Levy, Gil Lederman, Dima Sivov, Matteo Lissandrini, Davide Mottin

Research output: Contribution to book/anthology/report/conference proceeding › Article in proceeding › Research › peer-review

Abstract

Text-image retrieval (T2I) refers to the task of recovering all images relevant to a keyword query. Popular datasets for text-image retrieval, such as Flickr30k, VG, or MS-COCO, utilize annotated image captions, e.g., “a man playing with a kid”, as a surrogate for queries. With such surrogate queries, current multi-modal machine learning models, such as CLIP or BLIP, perform remarkably well. The main reason is the descriptive nature of captions, which detail the content of an image. Yet, T2I queries go beyond the mere descriptions in image-caption pairs. Thus, these datasets are ill-suited to test methods on more abstract or conceptual queries, e.g., “family vacations”. In such queries, the image content is implied rather than explicitly described. In this paper, we replicate the T2I results on descriptive queries and generalize them to conceptual queries. To this end, we perform new experiments on a novel T2I benchmark for the task of conceptual query answering, called ConQA. ConQA comprises 30 descriptive and 50 conceptual queries on 43k images with more than 100 manually annotated images per query. Our results on established measures show that both large pretrained models (e.g., CLIP, BLIP, and BLIP2) and small models (e.g., SGRAF and NAAF), perform up to 4× better on descriptive rather than conceptual queries. We also find that the models perform better on queries with more than 6 keywords as in MS-COCO captions.

Original language	English
Title of host publication	Advances in Information Retrieval - 46th European Conference on Information Retrieval, ECIR 2024, Proceedings : ECIR 2024
Editors	Nazli Goharian, Nicola Tonellotto, Yulan He, Aldo Lipani, Graham McDonald, Craig Macdonald, Iadh Ounis
Number of pages	16
Publication date	2024
Pages	161-176
ISBN (Print)	9783031560651
DOIs	https://doi.org/10.1007/978-3-031-56066-8_15
Publication status	Published - 2024

Access to Document

10.1007/978-3-031-56066-8_15

AUB Link

Search for the material in Aalborg University Library's search engine

Cite this

Rodriguez, J. M., Tavassoli, N., Levy, E., Lederman, G., Sivov, D., Lissandrini, M., & Mottin, D. (2024). Does the Performance of Text-to-Image Retrieval Models Generalize Beyond Captions-as-a-Query? In N. Goharian, N. Tonellotto, Y. He, A. Lipani, G. McDonald, C. Macdonald, & I. Ounis (Eds.), Advances in Information Retrieval - 46th European Conference on Information Retrieval, ECIR 2024, Proceedings: ECIR 2024 (pp. 161-176) https://doi.org/10.1007/978-3-031-56066-8_15

Rodriguez, Juan Manuel ; Tavassoli, Nima ; Levy, Eliezer et al. / Does the Performance of Text-to-Image Retrieval Models Generalize Beyond Captions-as-a-Query?. Advances in Information Retrieval - 46th European Conference on Information Retrieval, ECIR 2024, Proceedings: ECIR 2024. editor / Nazli Goharian ; Nicola Tonellotto ; Yulan He ; Aldo Lipani ; Graham McDonald ; Craig Macdonald ; Iadh Ounis. 2024. pp. 161-176

@inproceedings{7cf27458e73f4bc68aa6f048384fe102,

title = "Does the Performance of Text-to-Image Retrieval Models Generalize Beyond Captions-as-a-Query?",

abstract = "Text-image retrieval (T2I) refers to the task of recovering all images relevant to a keyword query. Popular datasets for text-image retrieval, such as Flickr30k, VG, or MS-COCO, utilize annotated image captions, e.g., “a man playing with a kid”, as a surrogate for queries. With such surrogate queries, current multi-modal machine learning models, such as CLIP or BLIP, perform remarkably well. The main reason is the descriptive nature of captions, which detail the content of an image. Yet, T2I queries go beyond the mere descriptions in image-caption pairs. Thus, these datasets are ill-suited to test methods on more abstract or conceptual queries, e.g., “family vacations”. In such queries, the image content is implied rather than explicitly described. In this paper, we replicate the T2I results on descriptive queries and generalize them to conceptual queries. To this end, we perform new experiments on a novel T2I benchmark for the task of conceptual query answering, called ConQA. ConQA comprises 30 descriptive and 50 conceptual queries on 43k images with more than 100 manually annotated images per query. Our results on established measures show that both large pretrained models (e.g., CLIP, BLIP, and BLIP2) and small models (e.g., SGRAF and NAAF), perform up to 4× better on descriptive rather than conceptual queries. We also find that the models perform better on queries with more than 6 keywords as in MS-COCO captions.",

author = "Rodriguez, {Juan Manuel} and Nima Tavassoli and Eliezer Levy and Gil Lederman and Dima Sivov and Matteo Lissandrini and Davide Mottin",

year = "2024",

doi = "10.1007/978-3-031-56066-8_15",

language = "English",

isbn = "9783031560651",

pages = "161--176",

editor = "Nazli Goharian and Nicola Tonellotto and Yulan He and Aldo Lipani and Graham McDonald and Craig Macdonald and Iadh Ounis",

booktitle = "Advances in Information Retrieval - 46th European Conference on Information Retrieval, ECIR 2024, Proceedings",

}

Rodriguez, JM, Tavassoli, N, Levy, E, Lederman, G, Sivov, D, Lissandrini, M & Mottin, D 2024, Does the Performance of Text-to-Image Retrieval Models Generalize Beyond Captions-as-a-Query? in N Goharian, N Tonellotto, Y He, A Lipani, G McDonald, C Macdonald & I Ounis (eds), Advances in Information Retrieval - 46th European Conference on Information Retrieval, ECIR 2024, Proceedings: ECIR 2024. pp. 161-176. https://doi.org/10.1007/978-3-031-56066-8_15

Does the Performance of Text-to-Image Retrieval Models Generalize Beyond Captions-as-a-Query? / Rodriguez, Juan Manuel; Tavassoli, Nima; Levy, Eliezer et al.
Advances in Information Retrieval - 46th European Conference on Information Retrieval, ECIR 2024, Proceedings: ECIR 2024. ed. / Nazli Goharian; Nicola Tonellotto; Yulan He; Aldo Lipani; Graham McDonald; Craig Macdonald; Iadh Ounis. 2024. p. 161-176.

Research output: Contribution to book/anthology/report/conference proceeding › Article in proceeding › Research › peer-review