AAU Student Projects - visit Aalborg University's student projects portal
A master's thesis from Aalborg University
Book cover


Automatic sonification of video sequences through object detection and physical modelling

Author

Term

4. Term

Publication year

2017

Abstract

Lydene fra genstande omkring os er en naturlig del af hverdagen. Vi forbinder bestemte lyde med konkrete genstande og handlinger, og vi forventer, at syn og lyd passer sammen—f.eks. betyder lyden af en motor typisk, at en bil nærmer sig. I film tilføjes mange af disse effekter senere i post-produktionen, en praksis kendt som Foley. Dette projekt undersøger, hvordan sådanne lyde kan genereres automatisk med udgangspunkt i en central egenskab ved hver genstand: dens materiale. Vi præsenterer et system i tre dele: en objektdetektor, der identificerer genstande i videoen, en detektor, der registrerer, når der sker slag eller sammenstød, og en lydmodel, der producerer den tilsvarende lyd. Derefter gennemfører vi en perceptuel evaluering, hvor deltagere ser videoklip og hører de lyde, modellen forudsiger. Undersøgelsen tester, om de genererede lyde stemmer overens med, hvad folk forventer at høre for de viste materialer og handlinger.

Sounds from the objects around us are part of everyday life. We link specific sounds to particular objects and actions, and we expect sight and sound to match—for example, a motor sound usually means a car is approaching. In film, many of these effects are added later in post-production, a practice known as Foley. This project explores generating such sound effects automatically, using a key property of each object—its material—as the main cue. We present a system with three parts: an object detector that identifies objects in video, an impact detector that finds moments when they collide or are struck, and a sound modeler that produces the corresponding audio. We then run a perceptual study in which participants watch video clips and hear the sounds predicted by the model. The study tests whether the generated sounds align with what people expect to hear for the materials and actions shown.

[This abstract was generated with the help of AI]