Enhancing Visual-Language Models in Zero-Shot Pedestrian-to-Driver Navigation Gesture Recognition in Conflicting Gesture-Authority Scenarios for Autonomous Driving Decision-Making
Author
Bossen, Tonko Emil Westerhof
Term
4. semester
Education
Publication year
2025
Submitted on
2025-06-04
Pages
73
Abstract
Every day, pedestrians use hand and body gestures to guide drivers—waving a car through, signaling it to stop, or clarifying who should go first. This thesis investigates whether today’s AI can recognize such pedestrian-to-driver navigation gestures without training on that specific task (a zero-shot setting), so systems can still make safe decisions in rare or conflicting situations. We explore vision-language models (VLMs)—AI models that link visual input with text—and propose two enhancement ideas that add body cues and pose information: Supplementary Body Description with VLM and Pose Projection. We also design three ways to evaluate performance: classification (assigning a gesture label), natural-language evaluation (describing the gesture in words), and reconstruction. To support this, we compile three annotated datasets: Acted Traffic Gesture (ATG), Instructive Traffic Gesture In-The-Wild (ITGI), and Acted Conflicting Authorities & Navigation Gestures (Act-CANG). Across three VLMs, initial results are poor in all evaluation domains. For example, VideoLLaMA3 achieves F1-scores of 0.02–0.06 in classification, both with and without the proposed enhancements. The F1-score combines precision and recall; values this low indicate the models rarely recognize gestures correctly. These findings highlight current limitations of VLMs for pedestrian navigation gesture recognition and point to the need for further research, such as fine-tuning or alternative approaches.
Hver dag bruger fodgængere hånd- og kropsbevægelser til at vejlede bilister—fx vinke en bil igennem, signalere stop eller afklare, hvem der bør køre først. Denne afhandling undersøger, om nutidens AI kan genkende sådanne fodgænger-til-bilist navigationsbevægelser uden at være trænet til den konkrete opgave (en zero-shot indstilling), så systemer stadig kan træffe sikre beslutninger i sjældne eller konfliktfyldte situationer. Vi undersøger vision-sprog-modeller (VLM’er)—AI-modeller, der kobler visuelle data med tekst—og foreslår to forbedringsidéer, der tilføjer kropskoder og poseinformation: Supplementary Body Description with VLM og Pose Projection. Vi designer også tre evalueringsmåder: klassifikation (at tildele en gestus-etiket), naturligt sprog (at beskrive gestussen med ord) og rekonstruktion. For at støtte dette samler vi tre annoterede datasæt: Acted Traffic Gesture (ATG), Instructive Traffic Gesture In-The-Wild (ITGI) og Acted Conflicting Authorities & Navigation Gestures (Act-CANG). På tværs af tre VLM’er er de første resultater svage i alle evalueringsdomæner. For eksempel opnår VideoLLaMA3 F1-scorer på 0,02–0,06 i klassifikation, både med og uden de foreslåede forbedringer. F1-scoren kombinerer præcision og recall; så lave værdier betyder, at modellerne sjældent genkender gestusserne korrekt. Disse fund understreger de nuværende begrænsninger ved VLM’er til genkendelse af fodgænger-navigationsbevægelser og peger på behovet for videre forskning, fx finjustering eller alternative tilgange.
[This apstract has been rewritten with the help of AI based on the project's original abstract]
