AAU Student Projects - visit Aalborg University's student projects portal
A master's thesis from Aalborg University
Book cover


Image and Video Analysis for Intelligent Driver Monitoring in Car Cabins

Translated title

Billede- og Videoanalyse til Intelligent Førerassistance i Bilkabiner

Author

Term

4. semester

Publication year

2024

Submitted on

Pages

89

Abstract

Dette kandidatspeciale i Computer Engineering (AI, Vision & Sound) blev gennemført under et semester på UC San Diego. Arbejdet er forskningsorienteret og er organiseret i adskilte dele, der afspejler en forskningsproces frem for et lineært udviklingsforløb. Fokus er videoforståelse i bilkabinen. For det første foreslås en metode til at genskabe manglende termiske videorammer (billeder baseret på varme) ud fra almindelig farvevideo (RGB) ved hjælp af betingede generative adversarielle netværk (cGANs), som gav gode resultater. For det andet gennemføres forsøg med vision-language-modeller—AI, der forbinder visuelle data med tekstlige etiketter—ved brug af flere kameravinkler til at klassificere føreraktiviteter, hvilket viser lovende generalisering på tværs af synsvinkler. For det tredje undersøges detektion af søvnighed hos førere med video-transformere, en type neuralt netværk til sekvenser af billeder, og der analyseres, hvor meget visuel detalje der kræves for at klassificere korrekt. Dette arbejde omfatter også en speciallavet ansigtsbeskåret version af UTA-RLDD-datasættet. Dele af arbejdet er accepteret på det 35. IEEE Intelligent Vehicles Symposium (IV) og på CVPR-workshoppen Vision and Language for Autonomous Driving and Robotics.

This master's thesis in Computer Engineering (AI, Vision & Sound) was carried out during a semester at UC San Diego. It is research-oriented and organized in separate parts that follow typical research steps rather than a linear development project. The work focuses on understanding video inside a car cabin. First, it proposes a way to reconstruct missing thermal video frames (images based on heat) from regular color video (RGB) using conditional Generative Adversarial Networks (cGANs), achieving well-performing results. Second, it experiments with vision-language models—AI that links visual input with text labels—using multiple camera angles to classify driver activities, showing promising generalization across viewpoints. Third, it studies driver drowsiness detection with video transformers, a neural-network family designed for sequences of frames, and analyzes how much visual detail is needed for accurate classification. This work also includes a custom face-cropped version of the UTA-RLDD dataset. Parts of the work have been accepted at the 35th IEEE Intelligent Vehicles Symposium (IV) and at the CVPR Vision and Language for Autonomous Driving and Robotics Workshop.

[This summary has been rewritten with the help of AI based on the project's original abstract]