A Variational Autoencoder Approach for Representation and Transformation of Sounds
Author
Lionello, Matteo
Term
4. Term
Education
Publication year
2018
Abstract
This thesis investigates how variational autoencoders (VAEs) can learn compact, class-aware representations of sounds and support transformation and generation of audio. Short waveforms of 4096 samples (approximately 512 ms at 8 kHz) are compressed 32x into 20 latent features that are used to reconstruct the input. The latent space is shaped to cluster items of the same class into sub-regions, enabling controlled morphing between regions corresponding to different sound types. To achieve this, the model combines a reconstruction loss regularized by a variational term with an auxiliary classification loss applied at the bottleneck. Three VAE architectures—convolutional, dilated, and hybrid—are compared. Evaluation focuses on (i) the organization of instances in the bottleneck via 1-Nearest Neighbour classification and (ii) reconstruction consistency via Dynamic Time Warping on MFCC features. A musical application with drum samples illustrates practical use. The excerpt outlines the approach and evaluation setup; specific quantitative findings are not provided here.
Dette speciale undersøger, hvordan variational autoencoders (VAE'er) kan lære kompakte, klassebevidste repræsentationer af lyd og understøtte transformation og generering af audio. Korte bølgeformer på 4096 samples (ca. 512 ms ved 8 kHz) komprimeres 32x til 20 latente features, som bruges til at rekonstruere inputtet. Latentrummet tilpasses, så elementer af samme klasse samles i underregioner, hvilket muliggør kontrolleret morphing mellem regioner for forskellige lydtyper. For at opnå dette kombinerer modellen en rekonstruktionsloss, regulariseret med en variational loss, med en hjælpe-klassifikationsloss ved flaskehalsen. Tre VAE-arkitekturer sammenlignes: konvolutionel, dilateret og hybrid. Evalueringen fokuserer på (i) organiseringen af instanser i flaskehalsen via 1-Nearest Neighbour klassifikation og (ii) rekonstruktionskonsistens via Dynamic Time Warping på MFCC-features. En musikalsk anvendelse med trommeprøver illustrerer praktisk brug. Uddraget beskriver tilgang og evalueringsopsætning; specifikke kvantitative resultater fremgår ikke her.
[This apstract has been generated with the help of AI directly from the project full text]
Keywords
