Speaker De-Identification using a Factorized Hierarchical Variational Autoencoder

Studenteropgave: Kandidatspeciale og HD afgangsprojekt

  • Mathias Bülow Kastbjerg
4. semester, Matematik-teknologi (cand.polyt.), Kandidat (Kandidatuddannelse)
In recent years the concept of Speaker
De-Identification (SDI) has emerged. SDI
handles the task of changing the speaker
identity of a speech signal from a source
speaker to a target speaker. Specifically
SDI focuses on masking the identity of the
source speaker. In (Hsu, Zhang, and Glass
2017) a Factorized Hierarchical Variational
Autoencoder (FHVAE) was introduced for
speech analysis. The FHVAE aims to factorize
the speech signal into a linguistic
part and a non-linguistic part. This factorization
motivates the use of the FHVAE
for SDI. The focus of this project is to investigate
the performance of the FHVAE
model when used for SDI. The model is
compared to a baseline system based on
a GMM mapping and a Harmonic plus
Stochastic Model. The performance of
the models is evaluated on two criteria:
1) Intelligibility, measured by an Automatic
Speech Recognition system computing
the Word Error Rate (WER). 2) How
well the systems mask the identity of the
source speaker, measured a speaker recognition
system computing the Equal Error
Rate (EER). Furthermore it is investigated
whether a simpler metric to measure the
intelligibility can be developed. The FHVAE
model showed good results on intelligibility
compared to the baseline, but
was found inferior on the de-identification
task. The search for a metric to replace the
WER as a measure of ineligibility was unsuccessful.
Udgivelsesdato7 jun. 2018
Antal sider44
ID: 280536562