A Structured Survey of Attention Mechanisms in Audio-Visual Fusion: Architectures, Challenges, and Evaluation Frameworks

Authors

  • Rexcharles Enyinna Donatus National Open University of Nigeria, Nigeria
  • Oludele Awodele Babcock University, Nigeria
  • Osondu Everestus Oguike University of Nigeria, Nigeria
  • Amina Sambo-Magaji National Information Technology Development Agency, Nigeria

DOI:

https://doi.org/10.64539/sjcs.v2i2.2026.438

Keywords:

Multimodal fusion, Audio-visual, Deep Learning, Attention mechanisms, Temporal modeling, Cross-modal attention

Abstract

Audio-visual fusion plays an important role in multimodal artificial intelligence, particularly in applications such as speech processing, emotion recognition, and video understanding, where information from sound and vision improves performance and contextual understanding. Recent developments are driven by attention mechanisms and transformer-based models, which enable more flexible and context-aware interaction within and across modalities compared to conventional fusion approaches. Despite these advances, challenges remain, including sensitivity to noisy or missing modalities, modality imbalance, limited interpretability, and high computational cost. This paper presents a structured survey of attention mechanisms in audio-visual fusion, with emphasis on architectural design and evaluation practices across multiple application domains. A structured survey methodology inspired by PRISMA principles is used to identify and select relevant studies, followed by comparative analysis of model architectures, training strategies, and evaluation methods. The findings show that transformer-based and attention-centered architectures have become increasingly prominent and achieve strong performance across tasks. However, these approaches involve trade-offs between robustness, interpretability, and computational efficiency, and remain sensitive to noise and modality imbalance. Evaluation practices are also inconsistent, with limited use of standardized and robustness-focused metrics. The survey provides an attention-centered taxonomy of audio-visual fusion methods and synthesizes current approaches and evaluation strategies. It identifies key challenges and outlines directions for improving robustness, interpretability, and efficiency in practical deployment.

References

[1] S. Li and H. Tang, “Multimodal alignment and fusion: A survey,” arXiv Prepr. arXiv2411.17040, 2024. https://doi.org/10.48550/arXiv.2411.17040.

[2] M. A. Manzoor, S. Albarri, Z. Xian, Z. Meng, P. Nakov, and S. Liang, “Multimodality representation learning: A survey on evolution, pretraining and its applications,” ACM Trans. Multimed. Comput. Commun. Appl., vol. 20, no. 3, pp. 1–34, 2023. https://doi.org/10.1145/3617833.

[3] Y. Yuan, Z. Li, and B. Zhao, “A survey of multimodal learning: Methods, applications, and future,” ACM Comput. Surv., vol. 57, no. 7, pp. 1–34, 2025. https://doi.org/10.1145/3713070.

[4] F. Zhao, C. Zhang, and B. Geng, “Deep multimodal data fusion,” ACM Comput. Surv., vol. 56, no. 9, pp. 1–36, 2024. https://doi.org/10.1145/3649447.

[5] N. Che, Y. Zhu, H. Wang, X. Zeng, and Q. Du, “AFT-SAM: adaptive fusion transformer with a sparse attention mechanism for Audio–Visual speech Recognition,” Appl. Sci., vol. 15, no. 1, p. 199, 2024. https://doi.org/10.3390/app15010199.

[6] A. Nagrani, S. Yang, A. Arnab, A. Jansen, C. Schmid, and C. Sun, “Attention bottlenecks for multimodal fusion,” Adv. Neural Inf. Process. Syst., vol. 34, pp. 14200–14213, 2021. https://dl.acm.org/doi/abs/10.5555/3540261.3541349.

[7] J. Dhanith, S. Venkatraman, V. Sharma, S. Malarvannan, and M. Narendra, “Multimodal Emotion Recognition using Audio-Video Transformer Fusion with Cross Attention,” arXiv Prepr. arXiv.2407.18552, 2024. https://doi.org/10.48550/arXiv.2407.18552.

[8] S. A. Abdu, A. H. Yousef, and A. Salem, “Multimodal video sentiment analysis using deep learning approaches, a survey,” Inf. Fusion, vol. 76, pp. 204–226, 2021. https://doi.org/10.1016/j.inffus.2021.06.003.

[9] S. Mai, Y. Sun, A. Xiong, Y. Zeng, and H. Hu, “Multimodal boosting: Addressing noisy modalities and identifying modality contribution,” IEEE Trans. Multimed., vol. 26, pp. 3018–3033, 2023. https://doi.org/10.1109/TMM.2023.3306489.

[10] R. E. Donatus, U. O. Chiedu, and I. H. Donatus, “Exploring the Impact of Convolutional Neural Networks on Facial Emotion Detection and Recognition,” Asian J. Electr. Sci., vol. 13, no. 1, pp. 35–45, 2024. https://doi.org/10.70112/ajes-2024.13.1.4241.

[11] B. Mocanu, R. Tapu, and T. Zaharia, “Multimodal Emotion Recognition using Cross Modal Audio-Video Fusion with Attention and Deep Metric Learning,” Image Vis. Comput., vol. 133, pp. 1–18, 2023. https://doi.org/10.1016/j.imavis.2023.104676.

[12] H. Han, Q. Zheng, M. Luo, K. Miao, F. Tian, and Y. Chen, “Noise-tolerant learning for audio-visual action recognition,” IEEE Trans. Multimed., vol. 26, pp. 7761–7774, 2024. https://doi.org/10.1109/TMM.2024.3371220.

[13] D. Priyasad, T. Fernando, S. Denman, S. Sridharan, and C. Fookes, “Learning Salient Features for Multimodal Emotion Recognition with Recurrent Neural Networks and Attention Based Fusion,” pp. 21–26, 2020. https://doi.org/10.21437/avsp.2019-5.

[14] T. Baltrusaitis, C. Ahuja, and L. P. Morency, “Multimodal Machine Learning: A Survey and Taxonomy,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, no. 2, pp. 423–443, 2019. https://doi.org/10.1109/TPAMI.2018.2798607.

[15] S. Moorthy and Y. K. Moon, “Hybrid Multi-Attention Network for Audio–Visual Emotion Recognition Through Multimodal Feature Fusion,” Mathematics, vol. 13, no. 7, pp. 1–30, 2025. https://doi.org/10.3390/math13071100.

[16] X. Peng, Y. Wei, A. Deng, D. Wang, and D. Hu, “Balanced multimodal learning via on-the-fly gradient modulation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 8238–8247. https://doi.org/10.1109/cvpr52688.2022.00806.

[17] J. Fu, J. Gao, B.-K. Bao, and C. Xu, “Multimodal imbalance-aware gradient modulation for weakly-supervised audio-visual video parsing,” IEEE Trans. Circuits Syst. Video Technol., vol. 34, no. 6, pp. 4843–4856, 2023. https://doi.org/10.1109/TCSVT.2023.3337134.

[18] R. E. Donatus, “Interpretable Speech Emotion Recognition: A Comparative Study of BiLSTM Temporal Attention and Transformer-Based,” Asian J. Electr. Sci., vol. 14, no. 2, pp. 21–27, 2025. https://doi.org/10.70112/ajes-2025.14.2.4286.

[19] C. Liu, Z. Mao, T. Zhang, A.-A. Liu, B. Wang, and Y. Zhang, “Focus your attention: A focal attention for multimodal learning,” IEEE Trans. Multimed., vol. 24, pp. 103–115, 2020. https://doi.org/10.1109/TMM.2020.3046855.

[20] X. Jiang, X. Bai, and L. Yin, “The Latest Research Progress of Attention Mechanism in Deep Learning,” pp. 82–89, 2025. https://doi.org/10.26689/jera.v9i3.10597.

[21] K. Bayoudh, “A survey of multimodal hybrid deep learning for computer vision: Architectures, applications, trends, and challenges,” Inf. Fusion, vol. 105, p. 102217, 2024. https://doi.org/10.1016/j.inffus.2023.102217.

[22] S. Kalamkar and G. M. Amalanathan, “MDA-ViT: Multimodal image fusion using dual attention vision transformer,” Multimed. Tools Appl., vol. 84, no. 21, pp. 23701–23723, 2025. https://doi.org/10.1007/s11042-024-19968-1.

[23] A. de Santana Correia and E. L. Colombini, “Attention, please! A survey of neural attention models in deep learning,” Artif. Intell. Rev., vol. 55, no. 8, pp. 6037–6124, 2022. https://doi.org/10.1007/s10462-022-10148-x.

[24] M. B. Shaikh, D. Chai, S. M. S. Islam, and N. Akhtar, “From CNNs to transformers in multimodal human action recognition: A survey,” ACM Trans. Multimed. Comput. Commun. Appl., vol. 20, no. 8, pp. 1–24, 2024. https://doi.org/10.1145/3664815.

[25] R. G. Praveen, P. Cardinal, and E. Granger, “Audio–visual fusion for emotion recognition in the valence–arousal space using joint cross-attention,” IEEE Trans. Biometrics, Behav. Identity Sci., vol. 5, no. 3, pp. 360–373, 2023. https://doi.org/10.1109/TBIOM.2022.3233083.

[26] J. Li, Y. Wu, Y. Qian, and C. Li, “Unified cross-modal attention: robust audio-visual speech recognition and beyond,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 32, pp. 1941–1953, 2024. https://doi.org/10.1109/TASLP.2024.3375641.

[27] G. Brauwers and F. Frasincar, “A general survey on attention mechanisms in deep learning,” IEEE Trans. Knowl. Data Eng., vol. 35, no. 4, pp. 3279–3298, 2021. https://doi.org/10.1109/TKDE.2021.3126456.

[28] Z. Niu, G. Zhong, and H. Yu, “A review on the attention mechanism of deep learning,” Neurocomputing, vol. 452, pp. 48–62, 2021. https://doi.org/10.1016/j.neucom.2021.03.091.

[29] H. Kumar and M. Aruldoss, “Advanced optimal cross-modal fusion mechanism for audio-video based artificial emotion recognition,” Informatica, vol. 49, no. 12, 2025. https://doi.org/10.31449/inf.v49i12.7392.

[30] R. G. Praveen, E. Granger, and P. Cardinal, “Cross attentional audio-visual fusion for dimensional emotion recognition,” 16th IEEE Int. Conf. Autom. face gesture Recognit., pp. 1–8, 2021. https://doi.org/10.1109/FG52635.2021.9667055.

[31] J. Wang, A. Zheng, L. Liu, C. Li, R. He, and J. Tang, “Adaptive Interaction and Correction Attention Network for Audio-Visual Matching,” IEEE Trans. Inf. Forensics Secur., 2025. https://doi.org/10.1109/TIFS.2025.3586484.

[32] R. S. Kiziltepe, J. Q. Gan, and J. J. Escobar, “Integration of feature and decision fusion with deep learning architectures for video classification,” IEEE Access, vol. 12, pp. 19432–19446, 2024. https://doi.org/10.1109/ACCESS.2024.3360929.

[33] B. Pan, K. Hirota, Z. Jia, L. Zhao, X. Jin, and Y. Dai, “Multimodal emotion recognition based on feature selection and extreme learning machine in video clips,” J. Ambient Intell. Humaniz. Comput., vol. 14, no. 3, pp. 1903–1917, 2023. https://doi.org/10.1007/s12652-021-03407-2.

[34] V. John and Y. Kawanishi, “Audio and video-based emotion recognition using multimodal transformers,” in 2022 26th International Conference on Pattern Recognition (ICPR), IEEE, 2022, pp. 2582–2588. https://doi.org/10.1109/ICPR56361.2022.9956730.

[35] J. Vibell, A. Lim, and S. Sinnett, “Temporal perception and attention in trained musicians,” Music Percept. An Interdiscip. J., vol. 38, no. 3, pp. 293–312, 2021. https://doi.org/10.1525/mp.2021.38.3.293.

[36] M. Brousmiche, J. Rouat, and S. Dupont, “Multimodal attentive fusion network for audio-visual event recognition,” Inf. Fusion, vol. 85, pp. 52–59, 2022. https://doi.org/10.1016/j.inffus.2022.03.001.

[37] J. Huang, J. Tao, B. Liu, Z. Lian, and M. Niu, “Multimodal Transformer Fusion for Continuous Emotion Recognition” ICASSP 2020 - 2020 IEEE Int. Conf. Acoust. Speech Signal Process., pp. 3507–3511, 2020. https://doi.org/10.1109/ICASSP40776.2020.9053762.

[38] D. Vamsidhar, P. Desai, A. K. Shahade, S. Patil, and P. V Deshmukh, “Hierarchical cross-modal attention and dual audio pathways for enhanced multimodal sentiment analysis,” Sci. Rep., vol. 15, no. 1, p. 25440, 2025. https://doi.org/10.1038/s41598-025-09000-3.

[39] Y.-H. Lee, D.-W. Jang, J.-B. Kim, R.-H. Park, and H.-M. Park, “Audio–visual speech recognition based on dual cross-modality attentions with the transformer model,” Appl. Sci., vol. 10, no. 20, p. 7263, 2020. https://doi.org/10.3390/app10207263.

[40] M. Hassanin, S. Anwar, I. Radwan, F. S. Khan, and A. Mian, “Visual attention methods in deep learning: An in-depth survey,” Inf. Fusion, vol. 108, p. 102417, 2024. https://doi.org/10.1016/j.inffus.2024.102417.

[41] W. Song, S. Ren, and B. Hu, “Interpretable Learning Method Based on Causal Interactive Attention,” IEEE Access, vol. 13, 2025. https://doi.org/10.1109/ACCESS.2025.3583583.

[42] E. Ghaleb, J. Niehues, and S. Asteriadis, “Joint modelling of audio-visual cues using attention mechanisms for emotion recognition,” Multimed. Tools Appl., vol. 82, no. 8, pp. 11239–11264, 2023. https://doi.org/10.1007/s11042-022-13557-w.

[43] X. He, D. Zhao, Y. Dong, G. Shen, X. Yang, and Y. Zeng, “Enhancing audio-visual spiking neural networks through semantic-alignment and cross-modal residual learning,” arXiv Prepr. arXiv2502.12488, 2025. https://doi.org/10.48550/arXiv.2502.12488.

[44] I. Kukanov and J. W. Ng, “KLASSify to Verify: Audio-Visual Deepfake Detection Using SSL-based Audio and Handcrafted Visual Features,” in Proceedings of the 33rd ACM International Conference on Multimedia, 2025, pp. 13707–13713. https://doi.org/10.1145/3746027.3761982.

[45] P. Zhang, J. Wang, M. Wan, S. Chang, L. Ding, and P. Shi, “Multi-Relation Learning Network for audio-visual event localization,” Knowledge-Based Syst., vol. 310, p. 112925, 2025. https://doi.org/10.1016/j.knosys.2024.112925.

[46] A. V. Geetha, T. Mala, D. Priyanka, and E. Uma, “Multimodal Emotion Recognition with Deep Learning: Advancements, challenges, and future directions,” Inf. Fusion, vol. 105, no. December 2023, p. 102218, 2024. https://doi.org/10.1016/j.inffus.2023.102218.

[47] S. Ghaffarian, J. Valente, M. Van Der Voort, and B. Tekinerdogan, “Effect of attention mechanism in deep learning-based remote sensing image processing: A systematic literature review,” Remote Sens., vol. 13, no. 15, p. 2965, 2021. https://doi.org/10.3390/rs13152965.

[48] A. I. Middya, B. Nag, and S. Roy, “Deep learning based multimodal emotion recognition using model-level fusion of audio–visual modalities,” Knowledge-Based Syst., vol. 244, p. 108580, 2022. https://doi.org/10.1016/j.knosys.2022.108580.

[49] S. Peerbasha, M. I. Habelalmateen, and T. Saravanan, “Multimodal Transformer Fusion for Sentiment Analysis using Audio, Text, and Visual Cues,” in 2025 International Conference on Intelligent Systems and Computational Networks (ICISCN), IEEE, 2025, pp. 1–6. https://doi.org/10.1109/ICISCN64258.2025.10934189.

[50] X. Liu, N. Xia, J. Zhou, Z. Li, and D. Guo, “Towards energy-efficient audio-visual classification via multimodal interactive spiking neural network,” ACM Trans. Multimed. Comput. Commun. Appl., vol. 21, no. 5, pp. 1–24, 2025. https://doi.org/10.1145/3721981.

[51] G. Sun et al., “Fine-grained audio-visual joint representations for multimodal large language models,” arXiv Prepr. arXiv2310.05863, 2023. https://doi.org/10.48550/arXiv.2310.05863.

[52] J. Li and Y. Tian, “From waveforms to pixels: A survey on audio-visual segmentation,” arXiv Prepr. arXiv2508.03724, 2025. https://doi.org/10.48550/arXiv.2508.03724.

[53] X. Zhao, Y. Wang, and X. Cai, “A ResNet-based audio-visual fusion model for piano skill evaluation,” Appl. Sci., vol. 13, no. 13, p. 7431, 2023. https://doi.org/10.3390/app13137431.

[54] H. Zhou, J. Du, Y. Zhang, Q. Wang, Q.-F. Liu, and C.-H. Lee, “Information fusion in attention networks using adaptive and multi-level factorized bilinear pooling for audio-visual emotion recognition,” IEEE/ACM Trans. audio, speech, Lang. Process., vol. 29, pp. 2617–2629, 2021. https://doi.org/10.1109/TASLP.2021.3096037.

[55] L. Parcalabescu and A. Frank, “MM-SHAP: A performance-agnostic metric for measuring multimodal contributions in vision and language models & tasks,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 4032–4059. https://doi.org/10.18653/v1/2023.acl-long.223.

[56] S. Khaled, M. E. Ragab, A. K. Helmy, W. Medhat, and E. H. Mohamed, “Ar-MuSA: A Multimodal Benchmark Dataset and Evaluation Framework for Arabic Sentiment Analysis.,” Int. J. Intell. Eng. Syst., vol. 18, no. 4, 2025. https://doi.org/10.22266/ijies2025.0531.03.

[57] D. Li et al., “Emotion recognition of subjects with hearing impairment based on fusion of facial expression and EEG topographic map,” IEEE Trans. Neural Syst. Rehabil. Eng., vol. 31, pp. 437–445, 2022. https://doi.org/10.1109/TNSRE.2022.3225948.

[58] N. Seijdel, J.-M. Schoffelen, P. Hagoort, and L. Drijvers, “Attention drives visual processing and audiovisual integration during multimodal communication,” J. Neurosci., vol. 44, no. 10, 2024. https://doi.org/10.1523/JNEUROSCI.0870-23.2023.

[59] Y. Zhang, D. Sidibé, O. Morel, and F. Mériaudeau, “Deep multimodal fusion for semantic image segmentation: A survey,” Image Vis. Comput., vol. 105, p. 104042, 2021. https://doi.org/10.1016/j.imavis.2020.104042.

[60] N. Saeed, M. Alam, and R. G. Nyberg, “A multimodal deep learning approach for gravel road condition evaluation through image and audio integration,” Transp. Eng., vol. 16, p. 100228, 2024. https://doi.org/10.1016/j.treng.2024.100228.

[61] I. Galanakis, R. F. Soldatos, N. Karanikolas, A. Voulodimos, I. Voyiatzis, and M. Samarakou, “Early and Late Fusion for Multimodal Aggression Prediction in Dementia Patients: A Comparative Analysis,” Appl. Sci., vol. 15, no. 11, p. 5823, 2025. https://doi.org/10.3390/app15115823.

[62] D. Michelsanti et al., “An overview of deep-learning-based audio-visual speech enhancement and separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 29, pp. 1368–1396, 2021. https://doi.org/10.1109/TASLP.2021.3066303.

[63] A. Lamichhane and G. Karn, “CNN-BiLSTM based Facial Emotion Recognition,” Int. J. Eng. Technol., vol. 2, no. 1, pp. 227–236, 2024. https://doi.org/10.3126/injet.v2i1.72579.

[64] G. Udahemuka, K. Djouani, and A. M. Kurien, “Multimodal Emotion Recognition Using Visual, Vocal and Physiological Signals: A Review,” Appl. Sci., vol. 14, no. 17, 2024. https://doi.org/10.3390/app14178071.

[65] Y. Wu, Q. Mi, and T. Gao, “A Comprehensive Review of Multimodal Emotion Recognition: Techniques, Challenges, and Future Directions,” Biomimetics, vol. 10, no. 7, 2025. https://doi.org/10.3390/biomimetics10070418.

Downloads

Published

2026-05-15

How to Cite

Donatus, R. E., Awodele, O., Oguike, O. E., & Sambo-Magaji, A. (2026). A Structured Survey of Attention Mechanisms in Audio-Visual Fusion: Architectures, Challenges, and Evaluation Frameworks. Scientific Journal of Computer Science, 2(2), 237–252. https://doi.org/10.64539/sjcs.v2i2.2026.438

Similar Articles

<< < 1 2 

You may also start an advanced similarity search for this article.