Deepfake Detection using Multimodal Machine Learning Techniques
DOI:
https://doi.org/10.15662/IJARCST.2023.0602001Keywords:
Deepfake Detection, Multimodal Learning, Audio–Visual Consistency, Affective Cues, Temporal Feature Prediction, DeepFake-TIMIT, FakeAVCelebAbstract
Deepfake media—synthetic audiovisual content crafted using advanced AI models—pose serious threats to trust, security, and privacy. Traditional detection methods focusing solely on visual cues are increasingly circumvented by more sophisticated forgeries. Multimodal machine learning, which combines audio and video modalities, offers a more resilient detection framework. This review surveys pre-2022 research on multimodal deepfake detection, highlighting approaches that exploit both emotional inconsistency and temporal misalignment between modalities. For example, Mittal et al.’s affective-cue-based Siamese network achieves AUC of 96.6% on the DeepFake-TIMIT dataset and 84.4% on DFDC by modeling emotion discrepancies across audio and visual channels arXiv. Khalid et al. evaluate unimodal, ensemble, and multimodal detectors over the FakeAVCeleb dataset, observing that neither unimodal nor naive multimodal baselines consistently outperform ensemble methods arXiv. Additional methods include temporal feature prediction leveraging contrastive learning to detect misuse of synchronization between audio–visual sequences, achieving accuracy ~84.3% and AUC ~89.9% on FakeAVCeleb MDPI. Survey studies reinforce that deepfake detection must span modalities to counter increasingly seamless forgeries PeerJ. We propose a general workflow: data collection → synchronized feature extraction → modality-specific embedding → cross-modal consistency modeling (e.g., contrastive, affective) → fusion and classification → evaluation. Advantages of multimodal frameworks include resilience to single-modality manipulation and modeling of inter-modal anomalies; disadvantages include the complexity of aligning asynchronous content, limited multimodal datasets, and heavier computation. We conclude that multimodal detection adds robustness but remains under-explored. Future directions include better dataset creation, fine-grained artifact modeling, self-supervised pretraining, and leveraging large pretrained multimodal models.
References
1. Mittal, T., Bhattacharya, U., Chandra, R., Bera, A., & Manocha, D. (2020). Emotions Don't Lie: An Audio-Visual Deepfake Detection Method Using Affective Cues. arXiv preprint. arXiv
2. Khalid, H., Kim, M., Tariq, S., & Woo, S. S. (2021). Evaluation of an Audio-Video Multimodal Deepfake Dataset using Unimodal and Multimodal Detectors. arXiv preprint. arXiv
3. Temporal Feature Prediction in Audio–Visual Deepfake Detection. arXiv preprint. MDPI
4. Deepfake detection across modalities and formats. PeerJ digital forensics survey (pre-2022 scope). PeerJ
5. Other related multimodal feature fusion methods discussed via Moonlight review.


