AAU Student Projects - visit Aalborg University's student projects portal
A master's thesis from Aalborg University
Book cover


DIPAAL: DIstributed PostgreSQL-based AIS Analytics and Loading

Authors

; ;

Term

4. term

Education

Publication year

2023

Pages

25

Abstract

AIS data (Automatic Identification System signals from ship transponders) have strong potential for analysis, but they are not designed for it and must be cleaned, processed, and stored before use. This thesis presents an extension of DIPAAL: a system with an efficient, modular ETL process (extract, transform, load) for ingesting AIS data and a distributed data warehouse that stores ship trajectories. We design, develop, and evaluate a spatially distributed data warehouse with two representations: grid cells (dividing the area into small cells) and heatmaps (density maps), enabling faster and more robust analytics. At the time of writing, DIPAAL stores 414 million kilometers of ship trajectories and more than 10 billion rows in the largest table. We find that the granular cell representation eliminates out-of-memory errors from prior work and makes queries up to 3.24x faster than trajectory-based queries. We also find that spatially divided shards (partitions) provide consistently good scale-up for both cell and heatmap analytics over large areas: 3.54x to 11.64x speedups when increasing workers by 5x. Finally, the spatial divisions become slightly imbalanced over time as traffic patterns evolve.

AIS-data (Automatic Identification System fra skibssendere) rummer stort potentiale for analyser, men de er ikke designet til formålet og kræver derfor grundig oprydning, behandling og lagring. Denne afhandling præsenterer en udvidelse af DIPAAL: et system med en effektiv, modulær ETL-proces (udtrække, transformere, indlæse) til at indlæse AIS-data samt et distribueret datalager, der gemmer skibenes ruter. Vi designer, udvikler og evaluerer et rumligt distribueret datalager med to repræsentationer: gitterceller (en opdeling af området i små celler) og heatmaps (tæthe kort), så data kan analyseres hurtigere og mere robust. Ved skrivetidspunktet lagrer DIPAAL 414 millioner kilometer af skibstrajektorier og mere end 10 milliarder rækker i den største tabel. Vi finder, at den granulære celle-repræsentation eliminerer out-of-memory-fejl fra tidligere arbejde og gør forespørgsler op til 3,24x hurtigere end trajektoribaserede forespørgsler. Vi finder også, at rumligt opdelte shards (partitioner) giver en konsekvent god skalering for både celle- og heatmap-analyser over store områder: 3,54x til 11,64x hastighedsgevinst ved at øge antallet af arbejdere med 5x. Endelig bliver de rumlige opdelinger en smule skæve over tid, i takt med at trafikmønstrene ændrer sig.

[This apstract has been rewritten with the help of AI based on the project's original abstract]