AAU Student Projects - visit Aalborg University's student projects portal
A master's thesis from Aalborg University
Book cover


Speech Coding using Deep Neural Networks and the Information Bottleneck Principle

Author

Term

4. semester

Publication year

2019

Submitted on

Pages

104

Abstract

Dette projekt undersøger, om dybe neurale netværk (DNN'er) og informationsflaskehals-princippet (Information Bottleneck, IB) kan bruges til talekodning, dvs. at komprimere tale og samtidig bevare kvaliteten. Vi udvikler end‑to‑end autoencodere og træner dem på syntetiske signaler og tale fra TIMIT‑databasen. Inde i netværkene koder en b‑bit skalar kvantisator den latente repræsentation, hvilket gør det muligt at styre bitraten via kvantiseringsparametre. Ydelsen måles på bitrate og Perceptual Evaluation of Speech Quality (PESQ) og sammenlignes med BroadVoice32 (BV32) og med standard 16‑bit pulse code modulation (PCM). Autoencodere trænet med mean squared error (MSE) som målfunktion overgik ikke BV32, når både bitrate og PESQ blev vurderet samlet. For bitrater på 5 bit pr. sample eller derover opnåede DNN'erne dog højere PESQ end BV32. Ved at analysere og udnytte de marginale entropier—hvor meget information hver kodet dimension bærer—opnåede vi en gennemsnitlig PESQ på 4,46 (standardafvigelse 0,03) med en bitrate på under halvdelen af 16‑bit PCM. Vi foreslog også en IB‑inspireret tabsfunktion, der kombinerer MSE med marginale entropier for at balancere kompression og kvalitet, men vi fandt ikke vægtninger, der gjorde denne tabsfunktion egnet til at træne tale‑autoencodere.

This project investigates whether deep neural networks (DNNs) and the Information Bottleneck (IB) principle can be used for speech coding—that is, compressing speech while preserving quality. We build end‑to‑end autoencoders and train them on synthetic signals and speech from the TIMIT database. Inside the networks, a b‑bit scalar quantizer encodes the latent representation, which makes the bit rate easy to control via quantizer parameters. Performance is assessed by bit rate and the Perceptual Evaluation of Speech Quality (PESQ), and compared with the BroadVoice32 (BV32) codec and with standard 16‑bit pulse‑code modulation (PCM). Autoencoders trained with mean squared error (MSE) as the objective did not surpass BV32 when considering both bit rate and PESQ together. However, for bit rates of 5 bits per sample or higher, the DNNs achieved higher PESQ scores than BV32. By analyzing and exploiting the marginal entropies—how much information each encoded dimension carries—we obtained an average PESQ of 4.46 (standard deviation 0.03) using a bit rate less than half of 16‑bit PCM. We also proposed an IB‑inspired loss that combines MSE with marginal‑entropy terms to balance compression and quality, but we were unable to find weightings that made this loss effective for training the speech autoencoders.

[This abstract was generated with the help of AI]