Listening Beyond Words: Transfer Learning for Audio Deepfake Detection
Author
Bonvang, Gustav Arnt Palmelund
Term
4. semester
Education
Publication year
2023
Submitted on
2023-06-01
Pages
51
Abstract
Audio deepfakes—synthetic or manipulated speech that imitates a real voice—are becoming increasingly convincing, creating a need to detect them. This project investigates how to identify such forgeries in audio. A review of prior work indicates that residual neural networks (ResNets) and transfer learning (adapting a pretrained model to a new task) are promising for audio deepfake detection. Based on this, we propose an approach using the ResNet50 architecture with transfer learning. Models are trained and evaluated on the In-The-Wild dataset. Each audio clip is converted into a mel spectrogram—a picture-like representation of sound frequencies over time—and rescaled before being fed into the network. We train multiple models with different hyperparameters, including baseline models trained without transfer learning. The best model, which uses transfer learning, achieves 96.7% accuracy and a 95.5% F1-score (a single metric that balances precision and recall). Compared with the non-transfer baselines, transfer learning yields an average increase of 21.90% in accuracy and 44.01% in F1-score. These results suggest that combining ResNet50 with transfer learning is an effective approach to detecting audio deepfakes in this setting.
Audio-deepfakes – syntetisk eller manipuleret lyd, der efterligner en rigtig stemme – bliver stadig mere overbevisende, hvilket øger behovet for at opdage dem. Dette projekt undersøger, hvordan man kan genkende sådanne forfalskninger i lyd. En gennemgang af eksisterende forskning peger på, at residuale neurale netværk (ResNets) og transfer learning (genbrug af et foruddannet net) er lovende metoder til audio-deepfake-detektion. På den baggrund foreslås en tilgang baseret på ResNet50-arkitekturen med transfer learning. Modellerne trænes og evalueres på In-The-Wild-datasættet. Hver lydfil omdannes til et mel-spektrogram – en billedlignende repræsentation af lydens frekvenser over tid – og skaleres, før det sendes ind i netværket. Der trænes flere modeller med forskellige hyperparametre, herunder baseline-modeller uden transfer learning. Den bedste model, som bruger transfer learning, opnår 96,7% nøjagtighed og en F1-score på 95,5% (en samlet måling, der balancerer præcision og recall/tilbagekaldelse). Sammenlignet med baseline-modellerne uden transfer learning gav transfer learning i gennemsnit en stigning på 21,90% i nøjagtighed og 44,01% i F1-score. Resultaterne tyder på, at ResNet50 kombineret med transfer learning er en effektiv tilgang til at opdage audio-deepfakes i denne kontekst.
[This apstract has been rewritten with the help of AI based on the project's original abstract]
Keywords
