Filter Bank Effects on Detecting Adversarial SpeechRecognition Attacks In Noise
Author
Nielsen, Christian Heider
Term
4. term
Publication year
2021
Abstract
This thesis examines how the choice of filter bank in cepstral feature extraction affects the detection of adversarial attacks against automatic speech recognition systems, particularly under noisy conditions. Building on standard signal processing steps (pre-emphasis, framing, windowing, FFT, and DCT), it compares different filter banks and derived cepstral features—including linear, Mel, inverse Mel, and gammatone (ERB)—within classification models trained to distinguish legitimate utterances from structured attacks. Both white-box and black-box attack settings are considered as threat models, while the core aim is to identify the most suitable feature space for detection rather than to differentiate attack types. Findings in this excerpt indicate that filter banks emphasizing higher frequencies over lower ones yield better detection performance than those with the opposite or uniform emphasis. An important exception arises when models are optimized under high-frequency background noise (e.g., clattering dishes and cutlery), where this advantage diminishes markedly. The study suggests that selecting feature scales not solely tailored to human perception and explicitly accounting for noise profiles can enhance ASR robustness against hidden adversarial inputs.
Denne afhandling undersøger, hvordan valget af filterbank i udtrækning af cepstrale talefunktioner påvirker detektering af fjendtlige (adversarial) angreb mod automatiske talegenkendelsessystemer, særligt i støjende omgivelser. Med udgangspunkt i klassiske signalbehandlingsskridt (præ-emfase, indramning, vinduering, FFT og DCT) sammenlignes forskellige filterbanke og deraf afledte cepstrale repræsentationer, herunder lineær, Mel, invers Mel og gammatone (ERB), i klassifikationsmodeller trænet til at skelne legitime ytringer fra strukturerede angreb. Både white-box- og black-box-angreb betragtes som relevante trusselsmodeller, mens fokus er at finde det bedst egnede feature-rum til detektion frem for angrebstype. Resultaterne i dette uddrag peger på, at filterbanke, som vægter højere frekvenser stærkere end lavere, giver bedre klassifikationsydelse til angrebsdetektion end filterbanke med modsat eller ligelig vægtning. En vigtig undtagelse fremkommer, når klassifikationsmodeller optimeres under baggrundsstøj domineret af højfrekvente komponenter (fx klirrende tallerkener og bestik), hvor fordelen reduceres markant. Undersøgelsen indikerer, at valg af ikke udelukkende menneskecentrerede filterbankskalaer samt eksplicit hensyntagen til støjprofiler kan styrke robustheden af ASR-systemer mod skjulte angreb.
[This apstract has been generated with the help of AI directly from the project full text]
