Author(s)
Term
4. term
Publication year
2022
Submitted on
2022-12-02
Pages
52 pages
Abstract
This thesis is based on real-world challenges for a data-intensive application in a data lake context, e.g. difficulties with understanding the domain and the meaning of the data. It explores how to optimize data discovery by preserving domain knowledge in a time-saving way. In order to transform domain knowledge into metadata, the thesis contributes with a definition of novel metadata concept. The metadata concept makes the task of implementing a data catalog and building a metadata fundament manageable and accessible. The concept guides the user to navigate through the implementation of metadata in the data catalog and classify the data. The concept is both tool, platform and domain independent. Sample data and the corresponding metadata is introduced. A data catalog is implemented, which is a collection of metadata, combined with data management and search tools, that helps users to find the data that they need. The catalog serves as an inventory of available data, and provides information of the data quality. A small sandbox environment is set up with some sample data. A custom analysis model is defined to learn how to handle the challenges. To ensure the analyses are conducted in a reproducible manner, the scenarios are expressed in a Quality Attribute Scenario framework. The scenario template outlines response measures, which are measurable. Usability and performance qualities are in focus. The usability scenarios cover automatic, manual and semi-automatic infer of metadata into the data catalog. A performance scenario explores how implicit infer of metadata using a structured directory affects the fetching of data from the data lake. The reader is guided through each analysis with a process map based on the metadata concept and a corresponding logbook. The user actions are marked and explained. The development logbook documents the steps in the analyses. Issues, solutions and learnings are elaborated. A result section at the end of each logbook clarifies the partial results. All results are combined and grouped by usability and performance scenarios. The results are discussed based on the metadata concept and the problem statement. Learnings from the analyzes are highlighted. Based on the usability and performance scenarios, it can be concluded that it is possible to preserve domain knowledge in a time-saving way and thus optimize data discovery. The best usability results are obtained with semi-automatic infer and storage of data in a structured directory provides the best performance. However, it should be noted that automatic infer in some cases can be a good choice, as it is possible to create searchable technical metadata very quickly. This is a much better starting point than having no information in the data catalog.
Keywords
Documents
Colophon: This page is part of the AAU Student Projects portal, which is run by Aalborg University. Here, you can find and download publicly available bachelor's theses and master's projects from across the university dating from 2008 onwards. Student projects from before 2008 are available in printed form at Aalborg University Library.
If you have any questions about AAU Student Projects or the research registration, dissemination and analysis at Aalborg University, please feel free to contact the VBN team. You can also find more information in the AAU Student Projects FAQs.