Image-Text Connection: Exploring the Expansion of the Diversity Within Joint Feature Space Similarity Score

Mohammadi, Mahsa; Eftekhari, Mahdi; Hassani, Amirhossein

dc.contributor.author	Mohammadi, Mahsa
dc.contributor.author	Eftekhari, Mahdi
dc.contributor.author	Hassani, Amirhossein
dc.date.accessioned	2023-11-21T14:55:25Z
dc.date.available	2023-11-21T14:55:25Z
dc.date.created	2023-11-10T12:19:42Z
dc.date.issued	2023
dc.identifier.citation	IEEE Access. 2023, 11, 123209-123222.	en_US
dc.identifier.issn	2169-3536
dc.identifier.uri	https://hdl.handle.net/11250/3103915
dc.description.abstract	Cross-modal representation learning aims to learn a shared representation space where data from multiple modalities can be effectively compared, fused, and understood. This paper investigates the role of increased diversity in the similarity score matrix in enhancing the performance of the CLIP (Contrastive Language-Image Pretraining), a multi-modal learning model that establishes a connection between images and text within a joint embedding space. Two transforming approaches, sine and sigmoid (including two versions), are incorporated into the CLIP model to amplify larger values and diminish smaller values within the similarity matrix (logits). Hardware limitations are addressed using a more compact text encoder (DistilBERT) and a pre-trained ResNet50 image encoder. The proposed adaptations are evaluated on various benchmarks, including image classification and image/text retrieval tasks, using 10 benchmark datasets such as Food101, Flickr30k, and COCO. The performance of the adapted models is compared to the base CLIP model using Accuracy, mean per class, and Recall@k metrics. The results demonstrate improvements in Accuracy (up to 5.32% enhancement for the PatchCamelyon dataset), mean per class (up to 14.48% enhancement for the FGVCAircraft dataset), and retrieval precision (with an increase of up to 45.20% in Recall@1 for the COCO dataset), compared to the baseline algorithm (CLIP).	en_US
dc.language.iso	eng	en_US
dc.rights	Attribution-NonCommercial-NoDerivatives 4.0 Internasjonal	*
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/4.0/deed.no	*
dc.title	Image-Text Connection: Exploring the Expansion of the Diversity Within Joint Feature Space Similarity Score	en_US
dc.title.alternative	Image-Text Connection: Exploring the Expansion of the Diversity Within Joint Feature Space Similarity Score	en_US
dc.type	Peer reviewed	en_US
dc.type	Journal article	en_US
dc.description.version	publishedVersion	en_US
dc.rights.holder	© 2023 The Authors.	en_US
dc.source.pagenumber	123209-123222	en_US
dc.source.volume	11	en_US
dc.source.journal	IEEE Access	en_US
dc.identifier.doi	10.1109/ACCESS.2023.3327339
dc.identifier.cristin	2195034
dc.relation.project	NILU: 120132	en_US
dc.relation.project	NILU: 121128	en_US
dc.relation.project	EC/H2020: 101037648	en_US
dc.relation.project	EEA and Norway Grants: 2019/35/J/HS6/03992	en_US
cristin.ispublished	true
cristin.fulltext	original
cristin.qualitycode	1

Tilhørende fil(er)

Filnavn:: Mohhammadi_et_al_IEEE_Access_1 ...
Størrelse:: 1.815Mb
Format:: PDF

Åpne

Denne innførselen finnes i følgende samling(er)

Publikasjoner fra Cristin - NILU [1380]
Vitenskapelige publikasjoner [1116]
Vitenskapelige artikler, kapitler og monografier.

Vis enkel innførsel

Med mindre annet er angitt, så er denne innførselen lisensiert som Attribution-NonCommercial-NoDerivatives 4.0 Internasjonal