The effect of different time-embeddings on an end-to-end diffusion speech enhancement model
Author
Jensen, Magnus Munk
Term
4. semester
Education
Publication year
2024
Abstract
Speech enhancement aims to make noisy recordings clearer. This work explores an end-to-end diffusion model that learns directly from raw audio waveforms instead of using a Short-Time Fourier Transform (STFT), a time–frequency representation. Diffusion models improve a signal by gradually removing noise over many steps. Our model uses a U-Net architecture and a time-step embedding that supplies information about which stage of the denoising process the network is in. The embedding technique is adapted from prior work but applied in a new way to speech enhancement within a diffusion framework. Results show that adding the time-step embedding is a key factor that significantly boosts the model’s capability. However, overall performance still falls short of current state-of-the-art systems, including SGMSE and Facebook Demucs.
Taleforbedring handler om at gøre støjfyldte optagelser mere klare. I dette projekt undersøger vi en end-to-end diffusionsmodel, der lærer direkte fra rå lydsignaler i stedet for at bruge en Short-Time Fourier Transform (STFT), som er en tid–frekvens-repræsentation. Diffusionsmodeller forbedrer gradvist et signal ved at fjerne støj trin for trin. Vores model bruger en U-Net-arkitektur og en tids-trinsindlejring, som giver netværket information om, hvilket trin i den gradvise afstøjning det befinder sig i. Teknikken til indlejringen er kendt fra andre sammenhænge, men her anvendt på en ny måde til taleforbedring i en diffusionsramme. Resultaterne viser, at det at tilføje tids-trinsindlejringen er en nøglefaktor, der markant forbedrer modellens evner. Samtidig ligger ydeevnen stadig under den nuværende frontlinje, herunder metoder som SGMSE og Facebook Demucs.
[This apstract has been rewritten with the help of AI based on the project's original abstract]
Keywords
diffusion ; AI ; Speech ; Enhancement
