AAU Student Projects - visit Aalborg University's student projects portal
A master's thesis from Aalborg University
Book cover


Integrating Multi-Modal Spatial Data using Knowledge Graphs - a Case Study of Microflora Danica

Authors

;

Term

4. semester

Publication year

2025

Abstract

This thesis investigates how to integrate semantically related, heterogeneous, multi-modal datasets when one modality is spatial. We use Microflora Danica (MfD)—GPS-annotated microbial samples—combined with environmental raster data from EcoDes-DK15 and Danish soil maps as a real-world case. The central challenge is to fuse vector and raster data across differing formats, resolutions, and coordinate systems while minimizing information loss. We outline a knowledge graph design that employs S2 Geometry as a common hierarchical spatial reference: MfD point locations and environmental raster cells are mapped to S2 cells, enabling uniform, semantically defined linking in RDF. To prevent data conflicts, we apply a majority rule so that each S2 cell is associated with a single raster value. Evaluating multiple S2 levels, we find that level 24 yields an acceptable information loss of about 2.2%, and that S2-based integration reduces information loss compared to up/downscaling baselines. We discuss the trade-off between spatial granularity and storage, and potential optimizations, including aggregating S2 cells (reducing cells by up to 81%), partitioning the graph for more efficient queries, and possible machine learning applications. Overall, our results indicate that S2-based knowledge graphs can support flexible and accurate integration of multi-modal spatial data with limited information loss.

Denne afhandling undersøger, hvordan semantisk relaterede, heterogene og multimodale datasæt kan integreres, når en af modaliteterne er spatial. Udgangspunktet er Microflora Danica (MfD), som indeholder GPS-annoterede prøver af mikroorganismer, der kobles med miljødata fra EcoDes-DK15 og jordbundskort i rasterformat. Vi adresserer hovedudfordringen ved at samle vektor- og rasterdata uden unødigt informationstab på tværs af forskellige formater, opløsninger og koordinatsystemer. Som løsning skitserer vi et knowledge graph-design, der anvender S2 Geometry som fælles, hierarkisk spatiale referencesystem: punkter fra MfD samt rasterceller fra miljødata mappes til S2-celler, hvilket muliggør en ensartet og semantisk veldefineret kobling i RDF. For at undgå datakonflikter anvender vi en majority rule, så hver S2-celle knyttes til én rasterværdi. Vi evaluerer forskellige S2-niveauer og finder, at niveau 24 giver et acceptabelt informationstab på ca. 2,2%, og at S2-baseret integration reducerer informationstab sammenlignet med op-/nedskalering som baseline. Vi diskuterer afvejningen mellem høj spatiel granularitet og lagerbehov samt mulige optimeringer, herunder S2-aggregering (op til 81% færre celler), grafopdeling for mere effektive forespørgsler og anvendelser i maskinlæring. Samlet set peger resultaterne på, at S2-baserede knowledge graphs kan understøtte fleksibel og præcis integration af multimodale spatiale data med begrænset informationstab.

[This apstract has been generated with the help of AI directly from the project full text]