AAU Student Projects - visit Aalborg University's student projects portal
A master's thesis from Aalborg University
Book cover


Audio Music Generation using Deep Learning in an End-to-End Approach

Author

Term

4. Term

Publication year

2018

Submitted on

Pages

72

Abstract

Denne afhandling undersøger, hvordan man kan syntetisere nye lyde med deep learning i en end-to-end tilgang. Her lærer systemet direkte af rå lyd og producerer nye lydprøver uden håndlavede mellemtrin eller ekstra input. Lydsyntese har en lang historie, men et afgørende skifte kom med WaveNet (2016), et neuralt netværk der genererer lyd én prøve ad gangen og tager alle tidligere prøver i betragtning. Jeg designer flere netværksarkitekturer til end-to-end lydgenerering og går derefter i dybden med WaveNet. Efter lovende resultater med global conditioning—at styre modellen med overordnet information som f.eks. instrumenttype—udvider jeg modellen til lokal conditioning, som giver tidsvarierende styring under genereringen. Jeg undersøger fordelene ved lokal conditioning for styring og lydgenerering. Det endelige værktøj kan automatisk skelne mellem og generere specifikke klaver- og panfløjtelyde. Det styres af udbredte lydbeskrivelser, mel-spektrum og MFCC’er (Mel-Frequency Cepstral Coefficients), som kompakt beskriver, hvordan energien fordeler sig på perceptuelle frekvensbånd over tid.

This thesis explores how to synthesize new sounds with deep learning in an end-to-end setup. In this approach, the system learns directly from raw audio and produces new audio samples without hand-crafted features or extra inputs. Sound synthesis has a long tradition, but a major shift came with WaveNet (2016), a neural network that generates audio one sample at a time while taking all previous samples into account. I design several network architectures for end-to-end audio generation and then focus on WaveNet for an in-depth study. After promising results with global conditioning—guiding the model with overall information such as the instrument type—I extend the model to use local conditioning, which provides time-varying guidance during generation. I study the benefits of local conditioning for control and sound generation. The final tool can automatically distinguish and generate specific piano and panflute sounds. It is guided by common audio descriptors, the mel spectrum and MFCCs (Mel-Frequency Cepstral Coefficients), which compactly represent how energy is distributed across perceptual frequency bands over time.

[This abstract was generated with the help of AI]