TV 2 Danmark - Comparative Analysis of Data Organization Techniques in Databricks
Author
Term
4. term
Publication year
2024
Submitted on
2024-09-26
Pages
63
Abstract
The goal of this project is to optimize one of TV 2 Danmarks larger delta datasets containing TV 2 Play tracking data for faster reading capabilities. The current dataset has multiple critical performance flaws which needs to be addressed as the response times for back-end transformation flows are challenged as well as the end users experience when querying on the dataset. There are four data organization techniques available in Spark which are Hive Partitioning, Z-Ordering, Liquid Clustering and Bloom Filters. To test which data organization technique performs the best, the original dataset has been duplicated to multiple variations of each of the four techniques and then tested by three selected performance perspectives. These performance perspectives are categorized by how well a dataset can be pruned for irrelevant files, how fast individuals can be identified and how fast larger subsets of data can be aggregated upon. The test results are indicating that the newly developed Liquid Clustering techniques has the best performance and even seems superior to the other techniques with only a few exceptions.
Keywords
Documents
