AAU Student Projects - visit Aalborg University's student projects portal
A master's thesis from Aalborg University
Book cover


Speaker De-Identification using a Factorized Hierarchical Variational Autoencoder

Author

Term

4. semester

Publication year

2018

Submitted on

Pages

44

Abstract

Denne afhandling undersøger Speaker De-Identification (SDI) – metoder til at ændre taleridentiteten i et lydklip, så kildetalerens identitet maskeres. Vi fokuserer på en maskinlæringsmodel kaldet Factorized Hierarchical Variational Autoencoder (FHVAE), der er designet til at adskille tale i to dele: indholdet af det, der siges (lingvistisk information), og måden det siges på (ikke-lingvistisk information, f.eks. talerkarakteristika). Denne egenskab gør FHVAE relevant for SDI. Vi sammenligner FHVAE med et baseline-system, der kombinerer en statistisk GMM-mapping og et signalbehandlingssystem kaldet et harmonisk plus stokastisk model. Ydelsen vurderes ud fra to kriterier: 1) Forståelighed (intelligibilitet), målt med et automatisk talegenkendelsessystem via Word Error Rate (WER), altså andelen af ord, der genkendes forkert. 2) Hvor godt systemerne skjuler kildetalerens identitet, målt med et talergenkendelsessystem via Equal Error Rate (EER), en standardfejlratenhed i talergenkendelse. Derudover undersøger vi, om en enklere målestok kan erstatte WER som mål for forståelighed. Resultaterne viser, at FHVAE giver bedre intelligibilitet end baseline-systemet, men klarer sig dårligere på selve de-identifikationen, dvs. at skjule kildetalerens identitet. Forsøget på at finde en enklere intelligibilitetsmetrik som erstatning for WER var ikke succesfuldt.

This thesis examines Speaker De-Identification (SDI)—methods for altering the speaker identity in an audio signal so that the source speaker is masked. We focus on a machine learning model called the Factorized Hierarchical Variational Autoencoder (FHVAE), which is designed to separate speech into two parts: the linguistic content (what is said) and non-linguistic factors (how it is said, such as speaker characteristics). This separation makes FHVAE a natural candidate for SDI. We compare the FHVAE to a baseline system that combines a statistical GMM mapping with a signal-processing approach known as a Harmonic plus Stochastic Model. Performance is evaluated on two criteria: 1) Intelligibility, measured by an automatic speech recognition system using Word Error Rate (WER), i.e., the proportion of words recognized incorrectly. 2) How well the systems mask the source speaker’s identity, measured by a speaker recognition system using Equal Error Rate (EER), a standard error-rate metric in speaker recognition. We also investigate whether a simpler metric could replace WER for assessing intelligibility. The results show that the FHVAE improves intelligibility compared to the baseline but performs worse at de-identification, meaning it is less effective at hiding the source speaker’s identity. The attempt to find a simpler intelligibility metric to replace WER was unsuccessful.

[This abstract was generated with the help of AI]