AVT2-DWF: Improving Deepfake Detection Through Audio-Visual Fusion
The paper introduces AVT2-DWF, a framework utilizing Audio-Visual dual Transformers with Dynamic Weight Fusion, aimed at enhancing detection capabilities against deepfake methods that adapt from single to multimodal fusions. Summary points:
- Dual transformers capture spatial and temporal dynamics.
- Uses a face transformer with an n-frame-wise tokenization strategy.
- Incorporates audio transformer encoders.
- Employs dynamic weight fusion for information fusion between audio and visual.
Personalized AI news from scientific papers.