Enhancing Visual-Language Models in Zero-Shot Pedestrian-to-Driver Navigation Gesture Recognition in Conflicting Gesture-Authority Scenarios for Autonomous Driving Decision-Making

Author

Bossen, Tonko Emil Westerhof

Term

4. semester

Education

Computer Engineering, Master

Publication year

2025

Submitted on

2025-06-04

Pages

Abstract

This study aims to address the task of recognizing pedestrian-to-driver navigation gestures in a zero-shot setting, enabling safe decision-making even in conflicting scenarios. Navigation gestures are a daily routine in driving to make it safe for all. Gesture in conflict is more of an edge case, but these situations can also be critical, making gesture recognition and decision-making essential. Recognizing pedestrians' gestures is a significant aspect of the study. This led to the development of enhancement methods Supplementary Body Description with VLM and Pose Projection and evaluation methods Classification, Natural-language, and Reconstruction of VLMs in this domain. Alongside, three datasets were created with annotations: Acted Traffic Gesture (ATG), Instructive Traffic Gesture In-The-Wild (ITGI), and Acted Conflicting Authorities & Navigation Gestures (Act-CANG). Across three VLMs, initial results were poor across all three evaluation domains. VideoLLaMA3, with and without enhancements, achieved F1-scores between 0.02 and 0.06 in classification. These results highlight the current limitations of VLMs in accurately recognizing pedestrian navigation gestures. This underscores the need for further research, either through fine-tuning or alternative approaches.

Documents

Download
View record in AAU Student Projects

An executive master's programme thesis from Aalborg University

Enhancing Visual-Language Models in Zero-Shot Pedestrian-to-Driver Navigation Gesture Recognition in Conflicting Gesture-Authority Scenarios for Autonomous Driving Decision-Making