AAU Student Projects - visit Aalborg University's student projects portal
A master's thesis from Aalborg University
Book cover


Automatic Speech Recognition-Driven Speech Enhancement Deep Neural Network: Automatic Speech Recognition-Driven Speech Enhancement Deep Neural Network

Translated title

Automatic Speech Recognition-Driven Speech Enhancement Deep Neural Network

Author

Term

4. term

Publication year

2023

Pages

57

Abstract

Speech enhancement systems aim to make noisy speech clearer and easier to understand by reducing background noise. However, traditional training measures like mean squared error (MSE) do not match how people perceive speech and often perform poorly in subjective listening tests. When objective metrics are used directly as loss functions (rules that tell the model how wrong it is), systems do better on objective tests, but these gains do not always align with human intelligibility. This thesis explores using an automatic speech recognition (ASR) system as part of the loss function in a speech enhancement model. The ASR provides a training signal based on word error rate (WER)—how many words are recognized incorrectly—so the enhancer is optimized for speech that is easier to recognize. The goal is to narrow the gap between objective scores and human listening, with the hypothesis that reducing WER on noisy speech will improve intelligibility in both subjective and objective evaluations.

Taleforbedringssystemer forsøger at gøre støjende tale mere klar og lettere at forstå ved at reducere baggrundsstøj. Traditionelle træningsmål som middelkvadratfejl (MSE) afspejler dog ikke særlig godt, hvordan mennesker oplever tale, og de klarer sig ofte dårligt i subjektive lytte-tests. Når objektive målinger bruges direkte som tabsfunktioner (dvs. regler der fortæller modellen, hvor stor fejlen er), giver de bedre resultater i objektive tests, men de stemmer ikke altid overens med menneskers forståelighed. Dette speciale undersøger at bruge et system til automatisk talegenkendelse (ASR) som en del af tabsfunktionen i et taleforbedringssystem. ASR’en giver et træningssignal baseret på ordfejlraten (WER) – hvor mange ord der genkendes forkert – så modellen optimeres mod tale, der er lettere at genkende. Målet er at mindske forskellen mellem objektive scorer og menneskelig lytning, med hypotesen om, at sænkning af WER på støjende tale vil forbedre forståeligheden i både subjektive og objektive evalueringer.

[This apstract has been rewritten with the help of AI based on the project's original abstract]