The effect of different time-embeddings on an end-to-end diffusion speech enhancement model
Author
Term
4. semester
Education
Publication year
2024
Submitted on
2024-05-30
Abstract
This research proposes an end-to-end diffusion model for speech enhancement, trained directly on raw audio waveforms. While aiming to achieve performance comparable to existing methods that rely on Short-Time Fourier Transform (STFT) representations, the model utilizes a U-Net structure with a time step embedding. Here, the embedding leverages an existing technique but applies it in a novel way for speech enhancement within a diffusion model framework. This embedding facilitates the model’s awareness of its position within the diffusion process, potentially improving performance. The results demonstrate that incorporating the time step embedding is a key factor, significantly enhancing the model’s capabilities. However, the model’s performance remains below current state-of-the-art methods like SGMSE and Facebook Demucs for speech enhancement.
Keywords
diffusion ; AI ; Speech ; Enhancement
Documents
