AAU Student Projects - visit Aalborg University's student projects portal
A master's thesis from Aalborg University
Book cover


Human Action Recognition using Bag of Features

Author

Term

4. term

Publication year

2012

Submitted on

Pages

51

Abstract

Denne afhandling undersøger den udbredte bag-of-features-tilgang til at genkende menneskelige handlinger i video. I denne tilgang omdannes små mønstre i rum og tid, der findes i en video, til 'visuelle ord', som tælles for at repræsentere hvert klip. Vi sammenligner flere feature-detektorer (der finder informative punkter i rum og tid) og deskriptorer (der beskriver, hvordan punkterne ser ud og bevæger sig). I eksperimenterne bruges Harris3D-detektoren sammen med HOG/HOF-deskriptoren. Vi vurderer både superviseret og usuperviseret klassifikation for at se forskellen mellem læring med og uden mærkede eksempler. Den superviserede metode er en supportvektormaskine (SVM). De usuperviserede metoder er k-means-klyngedannelse og affinity propagation; sidstnævnte er ikke tidligere brugt til handlingsgenkendelse. Vi tester på to datasæt: det enklere KTH-datasæt med 6 klasser og det mere krævende UCF50 med 50 klasser. SVM opnår gode resultater på begge datasæt. Blandt de usuperviserede metoder klarer affinity propagation sig bedst, men SVM overgår begge usuperviserede tilgange. Endelig undersøger vi, hvordan man bygger det visuelle ordforråd, et centralt trin i bag-of-features. Resultaterne viser, at det at øge antallet af visuelle 'ord' giver en lille forbedring af klassifikationspræstationen.

This thesis examines the popular bag-of-features approach to recognizing human actions in video. In this approach, small space–time patterns detected in a video are turned into 'visual words' and counted to represent each clip. We compare several feature detectors (which find informative points in space and time) and descriptors (which encode how those points look and move). For experiments, we use the Harris3D detector together with the HOG/HOF descriptor. We evaluate both supervised and unsupervised classification to show the difference between learning with and without labeled examples. The supervised method is a support vector machine (SVM). The unsupervised methods are k-means clustering and affinity propagation; the latter has not previously been used for action recognition. We test on two datasets: the simpler KTH dataset with 6 classes and the more challenging UCF50 with 50 classes. SVM achieves good results on both datasets. Among the unsupervised methods, affinity propagation performs best, but SVM outperforms both unsupervised approaches. Finally, we study how to build the visual vocabulary, a central step in bag-of-features. The results show that increasing the number of visual 'words' slightly improves classification performance.

[This abstract was generated with the help of AI]