TY - JOUR
T1 - Semantic deep learning and adaptive clustering for handling multimodal multimedia information retrieval
AU - Sattari, Saeid
AU - Yazici, Adnan
N1 - Publisher Copyright:
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024.
PY - 2024
Y1 - 2024
N2 - Multimedia data encompasses various modalities, including audio, visual, and text, necessitating the development of robust retrieval methods capable of harnessing these modalities to extract and retrieve semantic information from multimedia sources. This paper presents a highly scalable and versatile end-to-end framework for multimodal multimedia information retrieval. The core strength of this system lies in its capacity to learn semantic contexts within individual modalities and across different modalities, achieved through the utilization of deep neural models. These models are trained using combinations of queries and relevant shots obtained from query logs. One of the distinguishing features of this framework is its ability to create shot templates, representing videos that have not been encountered previously. To enhance retrieval performance, the system employs clustering techniques to retrieve shots similar to these templates. To address the inherent uncertainty in multimodal concepts, an improved variant of fuzzy clustering is applied. Additionally, a fusion method incorporating an OWA operator is introduced. This method employs various measures to aggregate ranked lists produced by multiple retrieval systems. The proposed approach leverages parallel processing and transfer learning to extract features from three distinct modalities, ensuring the adaptability and scalability of the framework. To assess its effectiveness and efficiency, the system is rigorously evaluated through experiments conducted on six widely recognized multimodal datasets. Remarkably, our approach outperforms previous studies in the literature on four of these datasets, achieving performance improvements ranging from 1.5% to 10.1% over the best reported results in those studies. The experimental findings, substantiated by statistical tests, conclusively establish the effectiveness of the proposed approach in the field of multimodal multimedia information retrieval.
AB - Multimedia data encompasses various modalities, including audio, visual, and text, necessitating the development of robust retrieval methods capable of harnessing these modalities to extract and retrieve semantic information from multimedia sources. This paper presents a highly scalable and versatile end-to-end framework for multimodal multimedia information retrieval. The core strength of this system lies in its capacity to learn semantic contexts within individual modalities and across different modalities, achieved through the utilization of deep neural models. These models are trained using combinations of queries and relevant shots obtained from query logs. One of the distinguishing features of this framework is its ability to create shot templates, representing videos that have not been encountered previously. To enhance retrieval performance, the system employs clustering techniques to retrieve shots similar to these templates. To address the inherent uncertainty in multimodal concepts, an improved variant of fuzzy clustering is applied. Additionally, a fusion method incorporating an OWA operator is introduced. This method employs various measures to aggregate ranked lists produced by multiple retrieval systems. The proposed approach leverages parallel processing and transfer learning to extract features from three distinct modalities, ensuring the adaptability and scalability of the framework. To assess its effectiveness and efficiency, the system is rigorously evaluated through experiments conducted on six widely recognized multimodal datasets. Remarkably, our approach outperforms previous studies in the literature on four of these datasets, achieving performance improvements ranging from 1.5% to 10.1% over the best reported results in those studies. The experimental findings, substantiated by statistical tests, conclusively establish the effectiveness of the proposed approach in the field of multimodal multimedia information retrieval.
KW - Adaptive fuzzy clustering
KW - Deep semantic learning
KW - Information fusion
KW - Multimodal multimedia retrieval
KW - Ranked lists fusion
UR - http://www.scopus.com/inward/record.url?scp=85194492441&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85194492441&partnerID=8YFLogxK
U2 - 10.1007/s11042-024-19312-7
DO - 10.1007/s11042-024-19312-7
M3 - Article
AN - SCOPUS:85194492441
SN - 1380-7501
JO - Multimedia Tools and Applications
JF - Multimedia Tools and Applications
ER -