Enhancing Visual-Language Models in Zero-Shot Pedestrian-to-Driver Navigation Gesture Recognition in Conflicting Gesture-Authority Scenarios for Autonomous Driving Decision-Making
Author
Term
4. semester
Education
Publication year
2025
Submitted on
2025-06-04
Pages
73
Abstract
This study aims to address the task of recognizing pedestrian-to-driver navigation gestures in a zero-shot setting, enabling safe decision-making even in conflicting scenarios. Navigation gestures are a daily routine in driving to make it safe for all. Gesture in conflict is more of an edge case, but these situations can also be critical, making gesture recognition and decision-making essential. Recognizing pedestrians' gestures is a significant aspect of the study. This led to the development of enhancement methods Supplementary Body Description with VLM and Pose Projection and evaluation methods Classification, Natural-language, and Reconstruction of VLMs in this domain. Alongside, three datasets were created with annotations: Acted Traffic Gesture (ATG), Instructive Traffic Gesture In-The-Wild (ITGI), and Acted Conflicting Authorities & Navigation Gestures (Act-CANG). Across three VLMs, initial results were poor across all three evaluation domains. VideoLLaMA3, with and without enhancements, achieved F1-scores between 0.02 and 0.06 in classification. These results highlight the current limitations of VLMs in accurately recognizing pedestrian navigation gestures. This underscores the need for further research, either through fine-tuning or alternative approaches.
Documents
