Speaker De-Identification using a Factorized Hierarchical Variational Autoencoder
Author
Term
4. semester
Education
Publication year
2018
Submitted on
2018-06-07
Pages
44
Abstract
In recent years the concept of Speaker De-Identification (SDI) has emerged. SDI handles the task of changing the speaker identity of a speech signal from a source speaker to a target speaker. Specifically SDI focuses on masking the identity of the source speaker. In (Hsu, Zhang, and Glass 2017) a Factorized Hierarchical Variational Autoencoder (FHVAE) was introduced for speech analysis. The FHVAE aims to factorize the speech signal into a linguistic part and a non-linguistic part. This factorization motivates the use of the FHVAE for SDI. The focus of this project is to investigate the performance of the FHVAE model when used for SDI. The model is compared to a baseline system based on a GMM mapping and a Harmonic plus Stochastic Model. The performance of the models is evaluated on two criteria: 1) Intelligibility, measured by an Automatic Speech Recognition system computing the Word Error Rate (WER). 2) How well the systems mask the identity of the source speaker, measured a speaker recognition system computing the Equal Error Rate (EER). Furthermore it is investigated whether a simpler metric to measure the intelligibility can be developed. The FHVAE model showed good results on intelligibility compared to the baseline, but was found inferior on the de-identification task. The search for a metric to replace the WER as a measure of ineligibility was unsuccessful.
Keywords
Documents
