TV 2 Danmark - Comparative Analysis of Data Organization Techniques in Databricks

Author

Kristensen, Andreas Esmann

Term

4. term

Education

Master of Information Technology, Software Development (Continuing education)

Publication year

2024

Submitted on

2024-09-26

Pages

Abstract

The goal of this project is to optimize one of TV 2 Danmarks larger delta datasets containing TV 2 Play tracking data for faster reading capabilities. The current dataset has multiple critical performance flaws which needs to be addressed as the response times for back-end transformation flows are challenged as well as the end users experience when querying on the dataset. There are four data organization techniques available in Spark which are Hive Partitioning, Z-Ordering, Liquid Clustering and Bloom Filters. To test which data organization technique performs the best, the original dataset has been duplicated to multiple variations of each of the four techniques and then tested by three selected performance perspectives. These performance perspectives are categorized by how well a dataset can be pruned for irrelevant files, how fast individuals can be identified and how fast larger subsets of data can be aggregated upon. The test results are indicating that the newly developed Liquid Clustering techniques has the best performance and even seems superior to the other techniques with only a few exceptions.

Keywords

Databricks ; Data Optimization Techniques ; Spark ; Hive Partitioning ; Z-Ordering ; Bloom Filter ; Liquid Clustering

Documents

Download
View record in AAU Student Projects

An executive master's programme thesis from Aalborg University

TV 2 Danmark - Comparative Analysis of Data Organization Techniques in Databricks