AAU Student Projects - visit Aalborg University's student projects portal
A master's thesis from Aalborg University
Book cover


The effect of different time-embeddings on an end-to-end diffusion speech enhancement model

Author

Term

4. semester

Publication year

2024

Abstract

Speech enhancement aims to make noisy recordings clearer. This work explores an end-to-end diffusion model that learns directly from raw audio waveforms instead of using a Short-Time Fourier Transform (STFT), a time–frequency representation. Diffusion models improve a signal by gradually removing noise over many steps. Our model uses a U-Net architecture and a time-step embedding that supplies information about which stage of the denoising process the network is in. The embedding technique is adapted from prior work but applied in a new way to speech enhancement within a diffusion framework. Results show that adding the time-step embedding is a key factor that significantly boosts the model’s capability. However, overall performance still falls short of current state-of-the-art systems, including SGMSE and Facebook Demucs.

Taleforbedring handler om at gøre støjfyldte optagelser mere klare. I dette projekt undersøger vi en end-to-end diffusionsmodel, der lærer direkte fra rå lydsignaler i stedet for at bruge en Short-Time Fourier Transform (STFT), som er en tid–frekvens-repræsentation. Diffusionsmodeller forbedrer gradvist et signal ved at fjerne støj trin for trin. Vores model bruger en U-Net-arkitektur og en tids-trinsindlejring, som giver netværket information om, hvilket trin i den gradvise afstøjning det befinder sig i. Teknikken til indlejringen er kendt fra andre sammenhænge, men her anvendt på en ny måde til taleforbedring i en diffusionsramme. Resultaterne viser, at det at tilføje tids-trinsindlejringen er en nøglefaktor, der markant forbedrer modellens evner. Samtidig ligger ydeevnen stadig under den nuværende frontlinje, herunder metoder som SGMSE og Facebook Demucs.

[This apstract has been rewritten with the help of AI based on the project's original abstract]