AAU Student Projects - visit Aalborg University's student projects portal
A master's thesis from Aalborg University
Book cover


Real-time implementation considerations of a deep learning based Voice Activity Detection

Author

Term

4. term

Publication year

2022

Submitted on

Pages

100

Abstract

Dette speciale, udført i samarbejde med RTX, undersøger en dybdelæringsbaseret metode til Voice Activity Detection (VAD) – teknologi der afgør, hvornår der er tale i et lydsignal. Målet er at kunne køre metoden i realtid på indlejrede enheder med begrænsede ressourcer. Vi adresserer tre spørgsmål: hvordan øges VAD’ens nøjagtighed, hvordan mindskes den algoritmiske forsinkelse (den tid fra lyd til beslutning), og hvordan tilpasses metoden til hardware med få ressourcer. Som led i arbejdet blev der indsendt en artikel til Interspeech 2022, som foreslår en måde at forbedre nøjagtighed og reducere forsinkelse på. Nøjagtigheden øges ved at bruge adversarial multi-task learning under træningen, og forsinkelsen reduceres ved at gøre netværkets filtre mindre. En kortere forsinkelse medfører en lille forringelse af nøjagtigheden. Derefter vurderes pruning og kvantisering – teknikker der gør modeller mindre ved at fjerne mindre vigtige parametre og bruge tal med lavere præcision – i dette projekt. Til sidst diskuteres hvilke hardwarearkitekturer den optimerede algoritme egner sig bedst til.

This thesis, conducted in collaboration with RTX, examines a deep learning method for Voice Activity Detection (VAD)—technology that decides when speech is present in audio. The goal is to run it in real time on embedded devices with limited computing resources. We address three questions: how to improve VAD accuracy, how to reduce algorithmic delay (the time from input to decision), and how to adapt the method for resource-constrained hardware. As part of the work, a paper was submitted to Interspeech 2022 proposing a way to boost accuracy and cut delay. Accuracy is improved by using adversarial multi-task learning during training, and delay is reduced by using smaller filters in the neural network. Shortening the delay results in a small drop in accuracy. We then evaluate pruning and quantization—techniques that shrink models by removing less important parameters and using lower-precision numbers—for this use case. Finally, we discuss which hardware architectures are best suited for implementing the optimized algorithm.

[This summary has been rewritten with the help of AI based on the project's original abstract]