Temporal Evidence Refinement for VLM-Based Weakly Supervised Video Anomaly Detection
Author
Han, Zexu
Term
4. semester
Education
Publication year
2026
Submitted on
2026-06-04
Abstract
This thesis examines how temporal evidence can be refined post hoc in weakly supervised video anomaly detection (WSVAD) when semantic cues are produced by a vision‑language model (VLM). It builds on VERA, which uses learned natural‑language questions to obtain segment‑level anomaly scores from a frozen VLM and then applies a coarse‑to‑fine refinement pipeline to produce frame‑level localization. The work formulates VERA’s score‑refinement stage as a controlled interface and tests three alternative score‑level transformations: local score aggregation, global similarity‑based score aggregation, and a local‑global residual fusion. Methodologically, it adopts the local‑global temporal distinction from VadCLIP as an analytical principle but applies it only after VERA’s segment‑level scoring, leaving the VLM backbone and prompt‑based semantic scoring unchanged. The variants are evaluated on UCF‑Crime and XD‑Violence with a focus on frame‑level anomaly localization using ROC‑AUC and average precision. Results indicate a metric trade‑off: VERA’s original refinement maintains the strongest ROC‑AUC ranking in the reported experiments, while the local‑global residual fusion yields the highest average precision. These findings suggest that score‑level refinement can change how anomaly evidence is concentrated over time but do not support replacing VERA’s original coarse‑to‑fine pipeline. The thesis contributes a clear formulation of the refinement interface, well‑defined local and global score variants, and a diagnostic analysis of their behavior across datasets.
Denne afhandling undersøger, hvordan tidslig evidens kan forfines post-hoc i svagt superviseret videoanomalidetektion (WSVAD), når semantiske signaler stammer fra en visions‑sprog‑model (VLM). Udgangspunktet er VERA, som med lærte, sproglige spørgsmål udleder segment‑niveau anomaliscores fra en fastfrosset VLM og derpå anvender en grov‑til‑fin forfiningskæde for at opnå frame‑niveau lokalisering. Arbejdet formulerer VERA’s forfiningsled som en kontrolleret grænseflade og afprøver tre alternative score‑niveau transformationer: lokal scoreaggregering, global lighedsbaseret scoreaggregering og en lokal‑global residual fusion. Metodisk bygger studiet på den lokale‑globale tidslige skelnen kendt fra VadCLIP, men anvender den udelukkende efterfølgende på VERA’s segment‑scores uden at ændre VLM‑rygraden eller prompt‑baseret semantisk scoring. Varianterne evalueres på UCF‑Crime og XD‑Violence med fokus på frame‑niveau anomalisering ved hjælp af ROC‑AUC og average precision. Resultaterne viser en metrikafvejning: VERA’s oprindelige forfining bevarer den stærkeste ROC‑AUC‑rangering i de rapporterede forsøg, mens lokal‑global residual fusion opnår den højeste average precision. Fundene peger på, at score‑niveau forfining kan ændre, hvordan anomalievidens koncentreres over tid, men giver ikke grundlag for at erstatte VERA’s oprindelige grov‑til‑fin pipeline. Afhandlingen bidrager med en klar problemformulering af forfiningsgrænsefladen, veldefinerede lokale og globale scorevarianter og en diagnostisk analyse af deres adfærd på tværs af datasæt.
[This apstract has been generated with the help of AI directly from the project full text]
Keywords
