Preserving domain knowledge in a data lake

Author

Thomsen, Mads Staberg

Term

4. term

Education

Master of Information Technology, Software Development (Continuing education)

Publication year

2022

Submitted on

2022-12-02

Pages

Abstract

This thesis is based on real-world challenges for a data-intensive application in a data lake context, e.g. difficulties with understanding the domain and the meaning of the data. It explores how to optimize data discovery by preserving domain knowledge in a time-saving way. In order to transform domain knowledge into metadata, the thesis contributes with a definition of novel metadata concept. The metadata concept makes the task of implementing a data catalog and building a metadata fundament manageable and accessible. The concept guides the user to navigate through the implementation of metadata in the data catalog and classify the data. The concept is both tool, platform and domain independent. Sample data and the corresponding metadata is introduced. A data catalog is implemented, which is a collection of metadata, combined with data management and search tools, that helps users to find the data that they need. The catalog serves as an inventory of available data, and provides information of the data quality. A small sandbox environment is set up with some sample data. A custom analysis model is defined to learn how to handle the challenges. To ensure the analyses are conducted in a reproducible manner, the scenarios are expressed in a Quality Attribute Scenario framework. The scenario template outlines response measures, which are measurable. Usability and performance qualities are in focus. The usability scenarios cover automatic, manual and semi-automatic infer of metadata into the data catalog. A performance scenario explores how implicit infer of metadata using a structured directory affects the fetching of data from the data lake. The reader is guided through each analysis with a process map based on the metadata concept and a corresponding logbook. The user actions are marked and explained. The development logbook documents the steps in the analyses. Issues, solutions and learnings are elaborated. A result section at the end of each logbook clarifies the partial results. All results are combined and grouped by usability and performance scenarios. The results are discussed based on the metadata concept and the problem statement. Learnings from the analyzes are highlighted. Based on the usability and performance scenarios, it can be concluded that it is possible to preserve domain knowledge in a time-saving way and thus optimize data discovery. The best usability results are obtained with semi-automatic infer and storage of data in a structured directory provides the best performance. However, it should be noted that automatic infer in some cases can be a good choice, as it is possible to create searchable technical metadata very quickly. This is a much better starting point than having no information in the data catalog.

Keywords

Domain knowledge ; Data lake ; Data discovery ; Data lineage ; Metadata concept ; Data asset

Documents

Download
View record in AAU Student Projects

An executive master's programme thesis from Aalborg University

Preserving domain knowledge in a data lake