Preserving domain knowledge in a data lake

Student thesis: Master programme thesis

  • Mads Staberg Thomsen
This thesis is based on real-world challenges for a data-intensive application in a
data lake context, e.g. difficulties with understanding the domain and the
meaning of the data. It explores how to optimize data discovery by preserving
domain knowledge in a time-saving way.
In order to transform domain knowledge into metadata, the thesis contributes
with a definition of novel metadata concept. The metadata concept makes the
task of implementing a data catalog and building a metadata fundament
manageable and accessible. The concept guides the user to navigate through the
implementation of metadata in the data catalog and classify the data. The
concept is both tool, platform and domain independent.
Sample data and the corresponding metadata is introduced. A data catalog is
implemented, which is a collection of metadata, combined with data
management and search tools, that helps users to find the data that they need.
The catalog serves as an inventory of available data, and provides information of
the data quality. A small sandbox environment is set up with some sample data.
A custom analysis model is defined to learn how to handle the challenges. To
ensure the analyses are conducted in a reproducible manner, the scenarios are
expressed in a Quality Attribute Scenario framework. The scenario template
outlines response measures, which are measurable.
Usability and performance qualities are in focus. The usability scenarios cover
automatic, manual and semi-automatic infer of metadata into the data catalog. A
performance scenario explores how implicit infer of metadata using a structured
directory affects the fetching of data from the data lake.
The reader is guided through each analysis with a process map based on the
metadata concept and a corresponding logbook. The user actions are marked
and explained. The development logbook documents the steps in the analyses.
Issues, solutions and learnings are elaborated. A result section at the end of each
logbook clarifies the partial results.
All results are combined and grouped by usability and performance scenarios.
The results are discussed based on the metadata concept and the problem
statement. Learnings from the analyzes are highlighted. Based on the usability
and performance scenarios, it can be concluded that it is possible to preserve
domain knowledge in a time-saving way and thus optimize data discovery. The
best usability results are obtained with semi-automatic infer and storage of data
in a structured directory provides the best performance.
However, it should be noted that automatic infer in some cases can be a good
choice, as it is possible to create searchable technical metadata very quickly. This
is a much better starting point than having no information in the data catalog.
Publication date2 Dec 2022
Number of pages52


ID: 503662483