Sharp Multiple Instance Learning for DeepFake Video Detection
This paper introduces a new problem of partial face attack in DeepFake video, where only video-level labels are provided but not all the faces in the fake videos are manipulated, and proposes a sharp MIL (S-MIL), which builds direct mapping from instance embeddings to bag prediction, rather than from instanceEmbedded to instance prediction and then to bag Prediction in traditional MIL.
Abstract
With the rapid development of facial manipulation techniques, face forgery\nhas received considerable attention in multimedia and computer vision community\ndue to security concerns. Existing methods are mostly designed for single-frame\ndetection trained with precise image-level labels or for video-level prediction\nby only modeling the inter-frame inconsistency, leaving potential high risks\nfor DeepFake attackers. In this paper, we introduce a new problem of partial\nface attack in DeepFake video, where only video-level labels are provided but\nnot all the faces in the fake videos are manipulated. We address this problem\nby multiple instance learning framework, treating faces and input video as\ninstances and bag respectively. A sharp MIL (S-MIL) is proposed which builds\ndirect mapping from instance embeddings to bag prediction, rather than from\ninstance embeddings to instance prediction and then to bag prediction in\ntraditional MIL. Theoretical analysis proves that the gradient vanishing in\ntraditional MIL is relieved in S-MIL. To generate instances that can accurately\nincorporate the partially manipulated faces, spatial-temporal encoded instance\nis designed to fully model the intra-frame and inter-frame inconsistency, which\nfurther helps to promote the detection performance. We also construct a new\ndataset FFPMS for partially attacked DeepFake video detection, which can\nbenefit the evaluation of different methods at both frame and video levels.\nExperiments on FFPMS and the widely used DFDC dataset verify that S-MIL is\nsuperior to other counterparts for partially attacked DeepFake video detection.\nIn addition, S-MIL can also be adapted to traditional DeepFake image detection\ntasks and achieve state-of-the-art performance on single-frame datasets.\n