Author(s)
Term
4. term
Publication year
2024
Submitted on
2024-09-26
Pages
63 pages
Abstract
The goal of this project is to optimize one of TV 2 Danmarks larger delta datasets containing TV 2 Play tracking data for faster reading capabilities. The current dataset has multiple critical performance flaws which needs to be addressed as the response times for back-end transformation flows are challenged as well as the end users experience when querying on the dataset. There are four data organization techniques available in Spark which are Hive Partitioning, Z-Ordering, Liquid Clustering and Bloom Filters. To test which data organization technique performs the best, the original dataset has been duplicated to multiple variations of each of the four techniques and then tested by three selected performance perspectives. These performance perspectives are categorized by how well a dataset can be pruned for irrelevant files, how fast individuals can be identified and how fast larger subsets of data can be aggregated upon. The test results are indicating that the newly developed Liquid Clustering techniques has the best performance and even seems superior to the other techniques with only a few exceptions.
Keywords
Documents
Colophon: This page is part of the AAU Student Projects portal, which is run by Aalborg University. Here, you can find and download publicly available bachelor's theses and master's projects from across the university dating from 2008 onwards. Student projects from before 2008 are available in printed form at Aalborg University Library.
If you have any questions about AAU Student Projects or the research registration, dissemination and analysis at Aalborg University, please feel free to contact the VBN team. You can also find more information in the AAU Student Projects FAQs.