Home / Papers / Joint Audio-Visual Deepfake Detection

Joint Audio-Visual Deepfake Detection

DOI: 10.1109/ICCV48922.2021.01453Semantic Scholar

149 Citations•2021•

Yipin Zhou, Ser-Nam Lim

2021 IEEE/CVF International Conference on Computer Vision (ICCV)

This work proposes a novel visual / auditory deepfake joint detection task and shows that exploiting the intrinsic synchronization between the visual and auditory modalities could benefit deepfake detection.

Abstract

Deepfakes ("deep learning" + "fake") are videos synthetically generated with AI algorithms. While they could be entertaining, they could also be misused for falsifying speeches and spreading misinformation. The process to create deepfakes involves both visual and auditory manipulations. Exploration on detecting visual deepfakes has produced a number of detection methods as well as datasets, while audio deepfakes (e.g. synthetic speech from text-to-speech or voice conversion systems) and the relationship between the video and audio modalities have been relatively neglected. In this work, we propose a novel visual / auditory deepfake joint detection task and show that exploiting the intrinsic synchronization between the visual and auditory modalities could benefit deepfake detection. Experiments demonstrate that the proposed joint detection framework outperforms independently trained models, and at the same time, yields superior generalization capability on unseen types of deepfakes.