Recognizing North Atlantic right whale up-calls using Gaussian Mixture Models and Hidden Markov Models

Translated title

Authors

Larsen, Stine Back ; Nielsen, Morten Albeck

Term

4. term

Education

Computer Science, Master

Publication year

2013

Submitted on

2013-06-06

Pages

109

Abstract

Det moderne samfund er afhængig af fragtskibe, men fragtskibene kan have negativ effekt på marinelivet. Nordkaperen er en truet hvalart, og det ønskes derfor at minimere fragtskibes negative indflydelse på denne hvalart. Nordkaperen udsender ofte en karakteristisk lyd kaldet et up-call, som kan bruges til at detektere, om en Nordkaper hval er i et bestemt område. Et system, som ved hjælp af hydrofoner kan detektere, om en Nordkaper er i et bestemt område, er blevet konstrueret af Cornell University's Bioacoustic Research Program. I den forbindelse ønskes et klassificeringssystem, som kan genkende up-calls i lydoptagelser fra andre havlyde. Til dette formål er der stillet lydfiler af havlyde til rådighed. Disse er blevet annoteret med en label, der angiver om lydfilen indeholder et up-call eller ej. I dette speciale laves der et klassificeringssystem til at afgøre om en lydfil indeholder et up-call eller ej. For at lave det omtalte klassificeringssystem skal lydfilerne først præprocesseres til data, som beskriver lydkilden til indholdet af lydfilen så godt som muligt. Som data bruger vi MFCC, som er blevet brugt ofte til talegenkendelse men også til genkendelse af hvallyde. For at få MFCC udtrækkes der et digital signal fra hver lydfil, hvor på der laves adskillige transformationer. I denne proces deles signalet op i overlappende tidsintervaller. Resultatet for en lydfil er en datavektor for hvert tidsinterval, der består af MFCC for det pågældende tidsinterval. Desuden gør den første del af processen det muligt at lave et spektrogram over lydfilen, som kan bruges til at visualisere hvilke frekvenser som lydfilen indeholder. Klassificeringssystemet består af to modeller: En positiv model der repræsenterer datavektorerne for lydfiler, som indeholder et up-call, og en negativ model der repræsenterer datavektorerne for de lydfiler, som ikke indeholder et up-call. Systemet klassificerer en lydfil ved at udregne forholdet mellem, hvor sandsynligt det er at datavektorerne fra lydfilen er generet af den positive model, og hvor sandsynligt det er, at de er generet af den negative model. Resultatet sammenlignes med en tærskelværdi, og hvis forholdet er større end tærskelværdien klassificeres lydfilen som indeholdende et up-call. Tre forskellige modeltyper sammenlignes, for at undersøge hvilken der fungerer bedst, når den bruges i vores klassificeringssystem. Den første modeltype benytter en GMM, og opdeler ikke lydfilen i tidsintervaller. Der er derfor kun en datavektor per lydfil. I den anden modeltype benyttes der flere GMM. Her deles lydfilerne op i tidsintervaller, og hvert tidsinterval betragtes som værende genereret af hver sin GMM. Den tredje modeltype bruger en HMM, hvor der til hver tilstand af den underliggende Markov proces er associeret en GMM. EM-algoritmen bruges til at lære modellerne. Der bliver givet en generel beskrivelse af EM-algoritmen, og hvordan de to skridt i algoritmen udledes for GMM og HMM, hvor der især er lagt vægt på E-skridtet. Modellerne og EM-algoritmen for GMM og HMM er blevet implementeret, og specifikation omkring implementeringen er beskrevet. Vi sammenligner derefter de tre modeltype ved at bruge dem i klassificeringssystemet. De sammenlignes ved at finde arealet under ROC kurven, præcision (precision) og genkaldelse (recall), samt nøjagtighed (accuracy) og $F_1$-mål. For alle mål får modeltypen med GMM, hvor lydfilens ikke opdeles i tidsintervaller, højst værdier, derefter kommer modeltypen, hvor der er en GMM for hvert tidsinterval, og til sidst modeltypen der bruger HMM. En forvirrings (confusion) matrice for hver modeltype konstrueres derefter for det bedste punkt på ROC kurven, og det viser sig, at alle tre modeltyper har et højt antal lydfiler, modellerne klassificerer til at indeholde et up-call, selvom filerne faktisk ikke indeholder et up-call. Da vi havde forventet at HMM ville få de højeste værdier, kigger vi nærmere på denne model. Dette gøres ved at kigge på spektrogrammet for nogle lydfiler som indeholder et up-call. Derefter finder vi for en positiv HMM den mest sandsynlige vej igennem tilstandsrummet for disse lydfiler, og det ses om tidsintervallerne, der dækker up-callene, er i nogle bestemt tilstande, og tidsintervaller udenom er i andre tilstande, eller om det er tilfældigt. Det viser sig, at HMMen, til en vis grad, er i stand til at detektere placeringen af et up-call i en lyd fil.

The modern society is dependent on the transportation by cargo ships, but the cargo ships can cause a negative effect on the marine wild life. The North Atlantic right whale is an endangered species which are especially threatened by cargo ships collisions. The right whales frequently emits a characteristic sound known as an \emph{up-call}. This can be used for detecting when a right whale is in a particular area. A system using hydrophones for detecting whether a right whale is in a particular area has therefore been constructed by Cornell University's Bioacoustic Research Program~\cite{ListenForWhales}. In connection to this, a classification system, which can recognize when an audio recording of ocean sounds contains an up-call, is desired. For this purpose audio files containing ocean sounds, recorded by this system, have been provided. Each of these has been annotated with a label telling whether it contain an up-call or not~\cite{KaggleChallenge}. In this thesis we make a classification system which can classify whether an audio file contains an up-call or not. In order to make the classification system, the audio files must first be preprocessed into data which describe the source of the audio file content. We use the \acp{MFCC} as data which have been used often for speech recognition~\cite{Rabiner89atutorial,oppenheim2009discrete} but also for recognizing whale sounds~\cite{Brown2009,Roch2007}. In order to get the \acp{MFCC}, the digital signal is extracted form the audio files, and several transformations are made on the signal to get the features. In this process the signal is divided into overlapping frames. The result of the prepossessing is a feature vector for each frame of each audio file which consists of the \acp{MFCC} for that frame. The first parts of the process makes it possible to construct a spectrogram of the audio file which can be used for visualizing the frequencies of an audio file. The classification system contains two models: A positive model which represents the feature vectors of the audio files that contain an up-call, and a negative model which represents the feature vectors of the audio files which do not contain an up-call.The system classifies an audio file by calculating the ratio between the probability that the feature data for the audio file were generated by the positive model, and the probability that the feature data for the audio file were generated by the negative model. The result is compared with an threshold, and if it is higher than the threshold, the audio files is classified as containing an up-call. Three different model types are compared in order to investigate which performs best when used in the classification system. The first model type uses a \ac{GMM}, and does not divide the audio files into frames. There is therefore only one feature vector per audio file. The second model uses several \acp{GMM}. The audio files are divided into frames, and each frame is considered as being generated by a \ac{GMM}. The third model uses a \ac{HMM} where each state of the underlying Markov process has an associated \ac{GMM}. The \ac{EM}-algorithm is used for learning the models. A general description of the \ac{EM}-algorithm is given, and how the two steps of the algorithm can be derived for \ac{GMM} and \ac{HMM} is described, where especially the E-step has been emphasized. The models, and the \ac{EM}-algorithm for \acp{GMM} and \acp{HMM} has been implemented, and specifications of the implementations are described. We then compare how the three model types perform when used in the classification system. The comparison is made by finding the \ac{AUC} of the \ac{ROC} curves, precision, recall, accuracy, and $F_1$-measure. For all the measures, the model type using one \ac{GMM}, where the audio files are not divided into frames, scores the highest values. The model type using several frames and a \ac{GMM} for each frame scores the second highest values, and the model type which uses an \ac{HMM} scores the lowest values. A confusion matrix is then constructed for each model type for the best threshold on the \ac{ROC} curve, and it turns out, that all three model types classifies a high number of audio files as containing an up-call even though they do not contain an up-call. We had expected that the \ac{HMM} would score the highest values, so we investigated this model further. This was done by looking at the spectrogram for some of the audio files which contains an up-call. Thereafter the most probable path through the state space of a positive \ac{HMM} for these audio files is found. This is used to investigate whether there is a connection between the state of the \ac{HMM} to a given frame and the content of the frame. It turns out that the \ac{HMM}, to some degree, can detect the placement of the up-call in an audio file.

Keywords

HMM ; GMM ; EM-algorithm ; MFCC ; Feature extraction ; Classification ; Whale calls ; North Atlantic right whale ; Sound recognition

Documents

Download
View record in AAU Student Projects

A master's thesis from Aalborg University

Recognizing North Atlantic right whale up-calls using Gaussian Mixture Models and Hidden Markov Models