• Stine Back Larsen
  • Morten Albeck Nielsen
4. term, Computer Science, Master (Master Programme)
The modern society is dependent on the transportation by cargo ships, but the cargo ships can cause a negative effect on the marine wild life. The North Atlantic right whale is an endangered species which are especially threatened by cargo ships collisions. The right whales frequently emits a characteristic sound known as an \emph{up-call}. This can be used for detecting when a right whale is in a particular area. A system using hydrophones for detecting whether a right whale is in a particular area has therefore been constructed by Cornell University's Bioacoustic Research Program~\cite{ListenForWhales}. In connection to this, a classification system, which can recognize when an audio recording of ocean sounds contains an up-call, is desired. For this purpose audio files containing ocean sounds, recorded by this system, have been provided. Each of these has been annotated with a label telling whether it contain an up-call or not~\cite{KaggleChallenge}. In this thesis we make a classification system which can classify whether an audio file contains an up-call or not.

In order to make the classification system, the audio files must first be preprocessed into data which describe the source of the audio file content. We use the \acp{MFCC} as data which have been used often for speech recognition~\cite{Rabiner89atutorial,oppenheim2009discrete} but also for recognizing whale sounds~\cite{Brown2009,Roch2007}. In order to get the \acp{MFCC}, the digital signal is extracted form the audio files, and several transformations are made on the signal to get the features. In this process the signal is divided into overlapping frames. The result of the prepossessing is a feature vector for each frame of each audio file which consists of the \acp{MFCC} for that frame. The first parts of the process makes it possible to construct a spectrogram of the audio file which can be used for visualizing the frequencies of an audio file.

The classification system contains two models: A positive model which represents the feature vectors of the audio files that contain an up-call, and a negative model which represents the feature vectors of the audio files which do not contain an up-call.The system classifies an audio file by calculating the ratio between the probability that the feature data for the audio file were generated by the positive model, and the probability that the feature data for the audio file were generated by the negative model. The result is compared with an threshold, and if it is higher than the threshold, the audio files is classified as containing an up-call.

Three different model types are compared in order to investigate which performs best when used in the classification system. The first model type uses a \ac{GMM}, and does not divide the audio files into frames. There is therefore only one feature vector per audio file. The second model uses several \acp{GMM}. The audio files are divided into frames, and each frame is considered as being generated by a \ac{GMM}. The third model uses a \ac{HMM} where each state of the underlying Markov process has an associated \ac{GMM}.

The \ac{EM}-algorithm is used for learning the models. A general description of the \ac{EM}-algorithm is given, and how the two steps of the algorithm can be derived for \ac{GMM} and \ac{HMM} is described, where especially the E-step has been emphasized. The models, and the \ac{EM}-algorithm for \acp{GMM} and \acp{HMM} has been implemented, and specifications of the implementations are described.

We then compare how the three model types perform when used in the classification system. The comparison is made by finding the \ac{AUC} of the \ac{ROC} curves, precision, recall, accuracy, and $F_1$-measure. For all the measures, the model type using one \ac{GMM}, where the audio files are not divided into frames, scores the highest values. The model type using several frames and a \ac{GMM} for each frame scores the second highest values, and the model type which uses an \ac{HMM} scores the lowest values. A confusion matrix is then constructed for each model type for the best threshold on the \ac{ROC} curve, and it turns out, that all three model types classifies a high number of audio files as containing an up-call even though they do not contain an up-call. We had expected that the \ac{HMM} would score the highest values, so we investigated this model further. This was done by looking at the spectrogram for some of the audio files which contains an up-call. Thereafter the most probable path through the state space of a positive \ac{HMM} for these audio files is found. This is used to investigate whether there is a connection between the state of the \ac{HMM} to a given frame and the content of the frame. It turns out that the \ac{HMM}, to some degree, can detect the placement of the up-call in an audio file.
Publication date6 Jun 2013
Number of pages109
ID: 77314976