Abstract
This study introduces the first single branch network designed to tackle a spectrum of biometric matching scenarios, including unimodal, multimodal, cross-modal, and missing modality situations. Our method adapts the prototypical network loss to concurrently train on audio, visual, and thermal data within a unified multimodal framework. By converting all three data types into image format, we employ the Vision Transformer (ViT) architecture with shared model parameters, enabling the encoder to transform input modalities into a unified vector space. The multimodal prototypical network loss function ensures that vector representations of the same speaker are proximate regardless of their original modalities. Evaluation on SpeakingFaces and VoxCeleb datasets encompasses a wide range of scenarios, demonstrating the effectiveness of our approach. The trimodal model achieves an Equal Error Rate (EER) of 0.27% on the SpeakingFaces test split, surpassing all previously reported results. Moreover, with a single training, it exhibits comparable performance with unimodal and bimodal counterparts, including unimodal audio, visual, and thermal, as well as audio-visual, audio-thermal, and visual-thermal configurations. In cross-modal evaluation on the VoxCeleb1 test set (audio versus visual), our approach yields an EER of 24.1%, again outperforming state-of-the-art models. This underscores the effectiveness of our unified model in addressing diverse scenarios for biometric verification.
Original language | English |
---|---|
Pages (from-to) | 96729-96739 |
Number of pages | 11 |
Journal | IEEE Access |
Volume | 12 |
DOIs | |
Publication status | Published - 2024 |
Keywords
- Biometric matching
- cross-modal matching
- face verification
- face-audio association
- metric learning
- multimodal verification
- speaker verification
- transformer
ASJC Scopus subject areas
- General Computer Science
- General Materials Science
- General Engineering