AAU Student Projects - visit Aalborg University's student projects portal
An executive master's programme thesis from Aalborg University
Book cover


TV 2 Danmark - Comparative Analysis of Data Organization Techniques in Databricks

Author

Term

4. term

Publication year

2024

Submitted on

Pages

63

Abstract

Dette projekt har til formål at gøre læsning af et af TV 2 Danmarks større delta-datasæt med TV 2 Play-trackingdata hurtigere. Det nuværende datasæt har kritiske performanceproblemer, som både forlænger svartiderne i bagvedliggende transformationsflows og gør forespørgsler langsommere for slutbrugere. Vi undersøgte fire dataorganiseringsteknikker i Spark: Hive-partitionering, Z-Ordering, Liquid Clustering og Bloom-filtre. Disse teknikker ændrer, hvordan data grupperes og indekseres, så forespørgsler kan springe unødvendige data over og læse mindre. Metoden var at lave kopier af datasættet, hver organiseret med en af teknikkerne, og teste dem ud fra tre performanceperspektiver: (1) hvor effektivt irrelevante filer kan fravælges (pruning), (2) hvor hurtigt man kan finde poster for specifikke individer (punktopslag), og (3) hvor hurtigt større delmængder kan aggregeres. Testene viser, at den nyere Liquid Clustering-teknik samlet set giver den bedste ydeevne og generelt overgår de andre metoder, med kun få undtagelser.

This project aims to make reading one of TV 2 Denmark’s larger delta datasets—containing TV 2 Play tracking data—faster. The current dataset has critical performance bottlenecks that slow backend transformation workflows and make queries sluggish for end users. We evaluated four data organization techniques in Spark: Hive partitioning, Z-Ordering, Liquid Clustering, and Bloom filters. These methods change how data is grouped and indexed so queries can skip unnecessary data and scan less. We created copies of the dataset prepared with each technique and tested them from three performance perspectives: (1) how effectively irrelevant files can be skipped (pruning), (2) how quickly records for specific individuals can be found (point lookups), and (3) how fast larger subsets can be aggregated. Across these tests, the newer Liquid Clustering technique delivered the best overall performance and generally outperformed the other methods, with only a few exceptions.

[This summary has been rewritten with the help of AI based on the project's original abstract]