Speaker De-Identification using a Factorized Hierarchical Variational Autoencoder

Author

Kastbjerg, Mathias Bülow

Term

4. semester

Education

Mathematical Engineering, Master

Publication year

2018

Submitted on

2018-06-07

Pages

Abstract

In recent years the concept of Speaker De-Identification (SDI) has emerged. SDI handles the task of changing the speaker identity of a speech signal from a source speaker to a target speaker. Specifically SDI focuses on masking the identity of the source speaker. In (Hsu, Zhang, and Glass 2017) a Factorized Hierarchical Variational Autoencoder (FHVAE) was introduced for speech analysis. The FHVAE aims to factorize the speech signal into a linguistic part and a non-linguistic part. This factorization motivates the use of the FHVAE for SDI. The focus of this project is to investigate the performance of the FHVAE model when used for SDI. The model is compared to a baseline system based on a GMM mapping and a Harmonic plus Stochastic Model. The performance of the models is evaluated on two criteria: 1) Intelligibility, measured by an Automatic Speech Recognition system computing the Word Error Rate (WER). 2) How well the systems mask the identity of the source speaker, measured a speaker recognition system computing the Equal Error Rate (EER). Furthermore it is investigated whether a simpler metric to measure the intelligibility can be developed. The FHVAE model showed good results on intelligibility compared to the baseline, but was found inferior on the de-identification task. The search for a metric to replace the WER as a measure of ineligibility was unsuccessful.

Keywords

Speaker De-Identification ; Machine Learning ; Variational Bayes ; Autoencoder ; Speaker Recognition ; Automatic Speech Recognition ; Voice Transformation

Documents

Download
View record in AAU Student Projects

A master's thesis from Aalborg University

Speaker De-Identification using a Factorized Hierarchical Variational Autoencoder