AAU Student Projects - visit Aalborg University's student projects portal
A master's thesis from Aalborg University
Book cover


Speech Enhancement and Noise-Robust Automatic Speech Recognition: Harvesting the Best of Two Worlds

Authors

;

Term

4. term

Publication year

2015

Submitted on

Pages

144

Abstract

Denne afhandling undersøger, om der er en sammenhæng mellem, hvordan støjreduceringsalgoritmer klarer sig i to forskellige anvendelser: automatisk talegenkendelse (at få computere til at forstå tale) og taleforbedring (at gøre tale klarere for lyttere). Den gennemgår grundlæggende viden om taleproduktion og hørelse og forklarer centrale værktøjer som Mel-frekvens cepstrum-koefficienter (MFCC) til at beskrive lyd og skjulte Markov-modeller (HMM) til talegenkendelse. Afhandlingen fokuserer på ETSI Advanced Front-End (AFE), en standardmetode til feature-ekstraktion, og sammenligner dens ydeevne med moderne taleforbedringsalgoritmer ved brug af taledata fra Aurora-2-databasen. Et hovedfund er, at graden af, hvor aggressivt støjen fjernes, adskiller algoritmerne i de to felter; ved at justere denne aggressivitet kan resultaterne forbedres, når en algoritme bruges til den anden opgave. Arbejdet udvikler også logistiske modeller (statistiske modeller), som kan forudsige genkendelsespræstationen for ETSI AFE ud fra objektive mål for talekvalitet og taleforståelighed. Den mest præcise forudsigelse kom fra et korttids objektivt mål for taleforståelighed og en genkender trænet på både ren og støjende tale.

This thesis examines whether there is a link between how noise reduction algorithms perform in two settings: automatic speech recognition (teaching computers to understand speech) and speech enhancement (making speech sound clearer to listeners). It introduces core ideas about speech production and hearing, and explains key tools such as Mel-frequency cepstral coefficients (MFCCs) to describe sound and hidden Markov models (HMMs) for recognition. The study focuses on the ETSI Advanced Front-End (AFE), a standard feature-extraction method, and compares its performance with modern speech enhancement algorithms using speech data from the Aurora-2 database. A main finding is that how aggressively noise is removed distinguishes the algorithms in the two fields; tuning this aggressiveness can improve results when an algorithm is used for the other task. The work also builds logistic models (statistical models) that estimate ETSI AFE recognition performance from objective measures of speech quality and intelligibility. The most accurate predictor was based on the short-time objective intelligibility measure and a recognizer trained on both clean and noisy speech.

[This abstract was generated with the help of AI]