This paper uses large margin cosine loss function (LMCL) and online frequency masking augmentation to force the neural network to learn more robust feature embeddings and achieves the lowest equal error rate (EER) among all single-system submissions during the ASVspoof 2019 challenge.
Audio Deepfakes, technically known as logical-access voice spoofing techniques, have become an increased threat on voice interfaces due to the recent breakthroughs in speech synthesis and voice conversion technologies. Effectively detecting these attacks is critical to many speech applications including automatic speaker verification systems. As new types of speech synthesis and voice conversion techniques are emerging rapidly, the generalization ability of spoofing countermeasures is becoming an increasingly critical challenge. This paper focuses on overcoming this issue by using large margin cosine loss function (LMCL) and online frequency masking augmentation to force the neural network to learn more robust feature embeddings. We evaluate the performance of the proposed system on the ASVspoof 2019 logical access (LA) dataset. Additionally, we evaluate it on a noisy version of the ASVspoof 2019 dataset using publicly available noises to simulate more realistic scenarios. Finally, we evaluate the proposed system on a copy of the dataset that is logically replayed through the telephony channel to simulate spoofing attacks in the call center scenario. Our baseline system is based on residual neural network, and has achieved the lowest equal error rate (EER) of 4.04% among all single-system submissions during the ASVspoof 2019 challenge. Furthermore, the additional improvements proposed in this paper reduce the EER to 1.26%.