Vocoding with Differentiable Digital Signal Processing: Development of a Real-Time Vocal Effect Plugin
Author
Südholt, David
Term
4. Term
Education
Publication year
2022
Submitted on
2022-05-25
Abstract
This thesis explores how Differentiable Digital Signal Processing (DDSP) can be used to create musical vocal effects and be deployed in a real-time plugin. Two approaches are introduced. In the first, pitch and loudness are extracted from incoming singing or speech, and a pretrained DDSP decoder generates harmonic synthesis controls. To preserve phonetic content, the synthesized harmonic components are blended with the voice’s amplitude spectrum via a user-controlled interpolation, yielding a vocoder-like effect; this method underpins a real-time vocal effects plugin. In the second approach, autoencoder models are trained on datasets combining vocal and instrumental sounds; an MFCC-based latent variable z encodes phonetic information while the training data composition shapes the reconstructed timbre. During training, a transient “sweet spot” is observed in which lyrics remain intelligible while timbre shifts, before the model converges to standard voice transfer. The project includes the design and implementation of the plugin and a perceptual evaluation comparing both approaches; detailed evaluation outcomes are beyond the scope of this excerpt.
Denne afhandling undersøger, hvordan Differentiable Digital Signal Processing (DDSP) kan bruges til at skabe musikalske vokaleffekter og implementeres i et realtids-plugin. To tilgange præsenteres. I den første udtrækkes tonehøjde og lydstyrke fra indgående sang eller tale, hvorefter en foruddannet DDSP-dekoder genererer harmoniske synteseparametre. For at bevare fonetisk indhold blandes de syntetiserede harmoniske komponenter med amplitudespektret af den originale stemme via en kontrollerbar interpolation, hvilket giver en vokoder-lignende effekt. Denne metode danner grundlag for et realtids-vokaleffektplugin. I den anden tilgang trænes autoencoder-modeller på datasæt, der kombinerer vokal- og instrumentlyde; encoderens MFCC-baserede latente variabel z fanger information om fonetik, mens sammensætningen af træningsdata påvirker den rekonstruerede klangfarve. Under træning observeres en midlertidig “sweet spot”, hvor lyrisk forståelighed bevares, mens timbre ændres, før modellen konvergerer mod almindelig stemmeoverførsel. Projektet omfatter design og implementering af plugin’et samt en perceptuel evaluering, der sammenligner de to tilgange; detaljerede resultater heraf ligger uden for dette uddrag.
[This apstract has been generated with the help of AI directly from the project full text]
