AAU Student Projects - visit Aalborg University's student projects portal
A master's thesis from Aalborg University
Book cover


Jena Bloom: Query Execution with Bloom Filter Existence Check

Author

Term

4. term

Education

Publication year

2023

Submitted on

Pages

10

Abstract

Graph data management and triplestores have gained increasing attention. A triplestore stores information as triples (subject, predicate, object), commonly used in knowledge graphs. Efficient querying remains challenging, in part because many optimizations from traditional databases are not yet fully applied. This thesis investigates how to reduce query time by tackling costly existence checks inside join operations, which combine results from multiple parts of a query. Our approach replaces disk-based index lookups used only to test whether matches exist with an in-memory Bloom filter. A Bloom filter is a compact, probabilistic data structure that quickly answers whether a value is likely present (or definitely absent), avoiding expensive disk accesses. We also use statistics about triple data to choose which specific joins should use Bloom-filter checks to benefit execution time. We extend the reference triplestore Apache Jena with this optimization and integrate it into query planning; we refer to the approach as JenaBloom. Evaluated on more than 1,500 queries, the approach shows effectiveness for queries that return both empty and non-empty result sets.

Viden om grafer og triplestores har fået stigende opmærksomhed. En triplestore er en database, der lagrer information som tripler (emne, relation, objekt), ofte brugt i vidensgrafer. Det er stadig svært at køre forespørgsler effektivt i sådanne systemer, blandt andet fordi mange klassiske optimeringer fra relationelle databaser ikke er fuldt udnyttet. Denne afhandling undersøger, hvordan man kan reducere køretid ved at adressere dyre eksistenstjek i join‑operationer, som kombinerer resultater fra flere dele af en forespørgsel. Vores løsning erstatter diskbaserede indeksopslag, der kun bruges til at afgøre om der findes match, med et Bloom‑filter i hukommelsen. Et Bloom‑filter er en kompakt, probabilistisk datastruktur, der hurtigt kan afgøre, om noget sandsynligvis findes (eller helt sikkert ikke findes), hvilket kan spare dyre diskopslag. Vi bruger desuden statistik om tripledata til at vælge, hvilke konkrete join‑operationer der bør anvende Bloom‑filter‑tjek for at gavne køretiden. Vi udvider referencetriplestoren Apache Jena med denne optimering og integrerer den i forespørgselsplanlægningen; vi kalder tilgangen JenaBloom. Evalueringen på mere end 1.500 forespørgsler viser, at tilgangen er effektiv for både forespørgsler med tomme og med ikke‑tomme resultatsæt.

[This apstract has been rewritten with the help of AI based on the project's original abstract]