Speech Enhancement and Deep Learning Speaker Separation: Separation, Identification, and Enhancement of a Conversational Partner in a Cocktail Party Environment
Authors
Nielsen, Rasmus ; Andersen, Morten ; Busk, Jonas Kronborg
Term
4. term
Education
Publication year
2023
Submitted on
2023-06-01
Pages
79
Abstract
Many people find it hard to follow one person in places full of chatter and music—a challenge known as the cocktail party effect. This report explores a speech enhancement system that aims to isolate the user’s conversational partner. The proposed pipeline has three steps: (1) speech separation to split a noisy mixture into individual voices, (2) speaker ranking to decide which voice is most relevant, and (3) speech enhancement to make that voice clearer. The work is built on the newly proposed Minimum Overlap-Gap algorithm for ranking and enhancement, used to select and improve the target speaker. Because the separation step offers many design choices, we focus here on a single-microphone approach using deep learning. After surveying current methods, we investigate two leading architectures—Convolutional TasNet and Dual-Path Recurrent Neural Network—as examples of deep learning models for audio separation. We train and test these models on mixtures containing two, three, and four speakers, and explore techniques that might improve performance. Several models show promise for making the target speaker’s voice more intelligible.
Mange har svært ved at følge én person i omgivelser med snak, musik og anden støj—et problem kendt som cocktailparty-effekten. Denne rapport undersøger et system til tale-forbedring, der skal kunne isolere brugerens samtalepartner. Systemet består af tre trin: (1) taleadskillelse, som splitter en støjende blanding op i enkelte stemmer, (2) rangering af talere for at vælge den mest relevante stemme, og (3) tale-forbedring for at gøre den valgte stemme tydeligere. Arbejdet bygger på den nyintroducerede Minimum Overlap-Gap-algoritme til rangering og forbedring, som bruges til at udvælge og forbedre den ønskede taler. Da der er mange muligheder for adskillelsestrinnet, fokuserer vi her på en løsning med én mikrofon baseret på dyb læring. Efter at have kortlagt nyere metoder undersøger vi to fremtrædende arkitekturer—Convolutional TasNet og Dual-Path Recurrent Neural Network—som eksempler på dybe læringsmodeller til lydseparation. Modellerne trænes og testes på blandinger med to, tre og fire talere, og vi afprøver teknikker, der kan forbedre resultaterne. Flere modeller viser potentiale for at gøre den ønskede talers stemme mere tydelig.
[This apstract has been rewritten with the help of AI based on the project's original abstract]
Keywords
Neurale Netværk ; Tale Separation ; Tale forbedring ; Taler identifikation ; RNN ; NN ; TasNet ; DPRNN ; AI ; Deep Learning ; Machine Learning
