AAU Student Projects - visit Aalborg University's student projects portal
A master's thesis from Aalborg University
Book cover


Generative Adversarial Networks for Speech Processing

Author

Term

4. term

Publication year

2017

Abstract

Deep learning bruges i dag bredt inden for bl.a. computer vision, tale og sprog på grund af stærke resultater og stor fleksibilitet. En nyere idé er generative adversarial networks (GANs), hvor to modeller trænes mod hinanden i et spil-lignende forløb for at lære at skabe realistiske data. GANs har givet gode resultater i billedopgaver, men er mindre udbredt i taleområdet. I dette projekt undersøger vi, hvad adversarial træning kan bidrage med til talebehandling. Vi fokuserer på to anvendelser: taleforbedring (at fjerne støj fra optagelser) og automatisk talegenerering (at skabe talesignaler). For taleforbedring viser forsøg, at vores tilgang samlet set overgår den klassiske metode short-time spectral amplitude minimum mean square error og er på niveau med en teknik baseret på dybe neurale netværk. For automatisk talegenerering kan vores modeller skabe plausible spektrogrammer (en visuel fremstilling af lydens frekvenser over tid), men der kan høres nogle artefakter, når de omdannes til lyd igen. Vi stiller genererede eksempler til rådighed for en subjektiv vurdering af kvaliteten.

Deep learning is widely used in computer vision, speech, and language because it delivers strong performance and flexibility. A newer framework is generative adversarial networks (GANs), where two models are trained in a game-like setup to learn to produce realistic data. GANs have shown strong results in image tasks, but are less explored for speech. In this project, we examine how adversarial training can help with speech processing. We focus on two applications: speech enhancement (cleaning noise from recordings) and automatic speech generation (creating speech signals). For speech enhancement, experiments show that our approach overall outperforms the classical short-time spectral amplitude minimum mean square error method and is comparable to a deep neural network-based technique. For automatic speech generation, our models can produce plausible spectrograms (a visual time–frequency representation of sound), but some artifacts are audible after converting them back to audio. We provide generated samples for a subjective evaluation of quality.

[This abstract was generated with the help of AI]