AAU Student Projects is unavailable between June 15th 1.30pm and 17th 1.30pm due to planned system maintenance. The projects cannot be downloaded during this period.
AAU Student Projects - visit Aalborg University's student projects portal
A master's thesis from Aalborg University
Book cover


CellString: Optimising Query Performance for AIS Trajectories in Columnar Storage

Authors

; ;

Term

4. term

Education

Publication year

2026

Submitted on

Abstract

The Automatic Identification System (AIS) produces very large datasets that describe vessel movements across space and time. Traditional ways of querying these data, which use geometric lines (LineStrings) and spatial index structures (R-trees), perform poorly because the rectangles used to bound the geometries (Minimum Bounding Rectangles, MBRs) overlap too much. In this thesis, we introduce a columnar version of CellString, a cell-based data type that represents vessel trajectories and stops as contiguous sequences of cells. We use the Web Mercator XYZ tiling scheme, where geographic coordinates are converted into one-dimensional, bit-encoded quadkeys. This encoding preserves spatial locality directly in the keys, allowing the DuckDB database system to discard irrelevant data during table scans (zonemap pruning) without relying on additional, specialized spatial indexes. We further extend the CellString data type by adding entry and exit timestamps to each cell and by reorganising the physical storage into an unnested columnar layout. This replaces expensive geometric computations with fast, parallelisable integer join operations. We evaluate our approach on a real-world Danish AIS dataset with more than 1.8 billion AIS points, comparing CellString with LineString for spatial, temporal, and combined spatio-temporal range queries. The results show that CellString performance is strongly driven by query selectivity: we obtain significant speedups for queries that focus on restricted geographic areas or narrow time windows, especially in low-traffic regions. For less selective queries, the amount of intermediate data grows, reducing the advantage over LineString. Experiments also show that CellString queries can achieve super-linear speedups when more CPU cores are used (super-linear parallelisation). Through case studies, we demonstrate that CellString enables fast analysis of transit routes, supports fine-grained spatial coverage queries within the Danish Exclusive Economic Zone (EEZ), and offers finer spatio-temporal selectivity than LineString. Finally, we show that CellString tables can be tuned for either spatial or temporal workloads by changing the sort order of table rows.

Det Automatic Identification System (AIS) genererer meget store datasæt, som beskriver skibes bevægelser over tid og sted. Traditionelle metoder til at søge i disse data, der bruger geometriske linjer (LineStrings) og rumlige indeksstrukturer (R-træer), fungerer dårligt, fordi de rektangler, man bruger til at afgrænse områder (Minimum Bounding Rectangles, MBRs), overlapper hinanden for meget. I dette projekt præsenterer vi en kolonnebaseret version af CellString, en cellebaseret datatype, der repræsenterer skibsruter og stop som sammenhængende sekvenser af celler. Vi bruger Web Mercator XYZ-fliseskemaet, hvor geografiske koordinater omdannes til én-dimensionelle, bit-kodede såkaldte quadkeys. Denne kodning bevarer geografisk nærhed i selve nøglerne, så databasen DuckDB kan sortere irrelevante data fra (zonemap pruning) under tabelscan uden brug af ekstra, specialiserede rumlige indeks. Vi udvider desuden CellString-datatypen ved at tilføje ind- og udrejsetidspunkter for hver celle og ved at omlægge den fysiske datalagring til et ikke-indlejret (unnested) kolonneformat. På den måde kan vi erstatte tunge geometriske beregninger med hurtige, parallelle join-operationer på heltal. Metoden afprøves på et dansk AIS-datasæt fra den virkelige verden med mere end 1,8 milliarder AIS-positioner. Her sammenligner vi CellString med LineString for rumlige, tidslige og kombinerede rum‑tidslige forespørgsler. Resultaterne viser, at ydelsen for CellString i høj grad afhænger af selektiviteten i forespørgslerne: vi opnår markante hastighedsforbedringer for forespørgsler, der fokuserer på afgrænsede geografiske områder eller snævre tidsvinduer – især i områder med lav skibstrafik. For mindre selektive forespørgsler opstår der større mængder mellemliggende data, hvilket mindsker forskellen til LineString. Eksperimenterne viser også, at CellString-forespørgsler kan skalere bedre end lineært, når der tilføjes flere processorkerner (super-lineær parallelisering). Gennem casestudier viser vi, at CellString giver hurtige analyser af skibsruter, gør det muligt at lave detaljerede analyser af den rumlige dækning inden for den danske eksklusive økonomiske zone (EEZ), og giver mere præcis filtrering i både rum og tid end LineString. Endelig viser vi, at CellString-tabeller kan optimeres til enten rumlige eller tidslige arbejdsbelastninger ved at ændre rækkefølgen, som rækkerne sorteres i.

[This abstract has been rewritten with the help of AI based on the project's original abstract]