Speech Coding using Deep Neural Networks and the Information Bottleneck Principle
Studenteropgave: Kandidatspeciale og HD afgangsprojekt
- Barbara Martinovic
4. semester, Matematik-teknologi (cand.polyt.), Kandidat (Kandidatuddannelse)
In this project the possibility of using
Deep Neural Networks (DNNs) and the
Information Bottleneck (IB) principle
to perform speech coding is explored.
An end-to-end strategy using DNNs in
form of autoencoders is developed and
the DNNs are trained using both synthetic
data and speech files from the
TIMIT database. Signals are encoded
using a b-bit scalar quantizer employed
internally in the DNNs and the bit rate
is easy controllable by parameters of the
quantizer amongst others. It was found
that the the developed speech autoencoders
trained with the Mean Squared
Error (MSE) as a objective function
did not outperform the results obtained
by encoding signals using the Broad-
Voice32 (BV32) codec in terms of both
bit rate and Perceptual Evaluation of
Speech Quality (PESQ) scores. The
DNNs outperformed the BV32 codec
in terms of PESQ scores for bit rates of
5 bit per sample or higher. By exploring
the marginal entropies it was possible to
achieve an average PESQ score of 4:46
and standard deviation of 0:03 for the
DNN speech autoencoders and by using
a bit rate less than half the bit rate used
for standard 16-bit Pulse Code Modulation
encoding.
A loss function involving the MSE and
marginal entropies was proposed inspired
by the IB principle. However
it was not possible to find adequate
weights such that the loss function was
suitable for training DNN speech autoencoders
Deep Neural Networks (DNNs) and the
Information Bottleneck (IB) principle
to perform speech coding is explored.
An end-to-end strategy using DNNs in
form of autoencoders is developed and
the DNNs are trained using both synthetic
data and speech files from the
TIMIT database. Signals are encoded
using a b-bit scalar quantizer employed
internally in the DNNs and the bit rate
is easy controllable by parameters of the
quantizer amongst others. It was found
that the the developed speech autoencoders
trained with the Mean Squared
Error (MSE) as a objective function
did not outperform the results obtained
by encoding signals using the Broad-
Voice32 (BV32) codec in terms of both
bit rate and Perceptual Evaluation of
Speech Quality (PESQ) scores. The
DNNs outperformed the BV32 codec
in terms of PESQ scores for bit rates of
5 bit per sample or higher. By exploring
the marginal entropies it was possible to
achieve an average PESQ score of 4:46
and standard deviation of 0:03 for the
DNN speech autoencoders and by using
a bit rate less than half the bit rate used
for standard 16-bit Pulse Code Modulation
encoding.
A loss function involving the MSE and
marginal entropies was proposed inspired
by the IB principle. However
it was not possible to find adequate
weights such that the loss function was
suitable for training DNN speech autoencoders
Sprog | Engelsk |
---|---|
Udgivelsesdato | 7 jun. 2019 |
Antal sider | 104 |
Ekstern samarbejdspartner | RTX A/S Peter Mariager pm@rtx.dk Anden RTX A/S Ricco Jensen rje@rtx.dk Anden |