The effect of different time-embeddings on an end-to-end diffusion speech enhancement model

Author

Jensen, Magnus Munk

Term

4. semester

Education

Electronic Systems, Master

Publication year

2024

Abstract

This research proposes an end-to-end diffusion model for speech enhancement, trained directly on raw audio waveforms. While aiming to achieve performance comparable to existing methods that rely on Short-Time Fourier Transform (STFT) representations, the model utilizes a U-Net structure with a time step embedding. Here, the embedding leverages an existing technique but applies it in a novel way for speech enhancement within a diffusion model framework. This embedding facilitates the model’s awareness of its position within the diffusion process, potentially improving performance. The results demonstrate that incorporating the time step embedding is a key factor, significantly enhancing the model’s capabilities. However, the model’s performance remains below current state-of-the-art methods like SGMSE and Facebook Demucs for speech enhancement.

Keywords

diffusion ; AI ; Speech ; Enhancement

Documents

Download
View record in AAU Student Projects

A master's thesis from Aalborg University

The effect of different time-embeddings on an end-to-end diffusion speech enhancement model