AAU Student Projects - visit Aalborg University's student projects portal
An executive master's programme thesis from Aalborg University
Book cover


Enhancing Visual-Language Models in Zero-Shot Pedestrian-to-Driver Navigation Gesture Recognition in Conflicting Gesture-Authority Scenarios for Autonomous Driving Decision-Making

Term

4. semester

Publication year

2025

Submitted on

Pages

73

Abstract

This study aims to address the task of recognizing pedestrian-to-driver navigation gestures in a zero-shot setting, enabling safe decision-making even in conflicting scenarios. Navigation gestures are a daily routine in driving to make it safe for all. Gesture in conflict is more of an edge case, but these situations can also be critical, making gesture recognition and decision-making essential. Recognizing pedestrians' gestures is a significant aspect of the study. This led to the development of enhancement methods Supplementary Body Description with VLM and Pose Projection and evaluation methods Classification, Natural-language, and Reconstruction of VLMs in this domain. Alongside, three datasets were created with annotations: Acted Traffic Gesture (ATG), Instructive Traffic Gesture In-The-Wild (ITGI), and Acted Conflicting Authorities & Navigation Gestures (Act-CANG). Across three VLMs, initial results were poor across all three evaluation domains. VideoLLaMA3, with and without enhancements, achieved F1-scores between 0.02 and 0.06 in classification. These results highlight the current limitations of VLMs in accurately recognizing pedestrian navigation gestures. This underscores the need for further research, either through fine-tuning or alternative approaches.