Declarative Data Warehouse setup in PygramETL
Studenteropgave: Kandidatspeciale og HD afgangsprojekt
- Simon Mathiasen
4. semester, Datalogi, Kandidat (Kandidatuddannelse)
In order to begin Extract-Transform-Load pro-
gramming a data warehouse must be created in a database
management system, and the schema of the data warehouse
must be programmed in an Extract-Transform-Load framework
to properly load data from sources. However, the set up of a
data warehouse and the definition of a schema in an appropriate
framework can be labor intensive. Furthermore, the complexity
of this task increases as schemas become bigger, as the developer
must ensure that the data warehouse schema matches the schema
defined in the framework for Extract-Transform-Load. In this
paper I present the framework DeclarativeETL which is an
addition to PygramETL [3] used to generate implementation
for data warehouse schema, and PygramETL. DeclarativeETL
results in a DDL and Python file generated from a shared
declarative specification. By exploiting TOML [6], a simple
configuration language, and a simple syntax for the declarative
specification, developer productivity is increased as they are
only required to name dimension and fact tables, and their
respective attributes and measures, in conjunction with a set
of default values, e.g. schema type, and attribute and measure
types. The defaults saves the developer many keystrokes as most
attributes share the same type. DeclarativeETL is evaluated
to be fast and lightweight while providing more than 100%
increased productivity in terms of lines of code when compared
to programming DDL/PygramETL manually.
gramming a data warehouse must be created in a database
management system, and the schema of the data warehouse
must be programmed in an Extract-Transform-Load framework
to properly load data from sources. However, the set up of a
data warehouse and the definition of a schema in an appropriate
framework can be labor intensive. Furthermore, the complexity
of this task increases as schemas become bigger, as the developer
must ensure that the data warehouse schema matches the schema
defined in the framework for Extract-Transform-Load. In this
paper I present the framework DeclarativeETL which is an
addition to PygramETL [3] used to generate implementation
for data warehouse schema, and PygramETL. DeclarativeETL
results in a DDL and Python file generated from a shared
declarative specification. By exploiting TOML [6], a simple
configuration language, and a simple syntax for the declarative
specification, developer productivity is increased as they are
only required to name dimension and fact tables, and their
respective attributes and measures, in conjunction with a set
of default values, e.g. schema type, and attribute and measure
types. The defaults saves the developer many keystrokes as most
attributes share the same type. DeclarativeETL is evaluated
to be fast and lightweight while providing more than 100%
increased productivity in terms of lines of code when compared
to programming DDL/PygramETL manually.
Sprog | Engelsk |
---|---|
Udgivelsesdato | 16 jun. 2023 |
Antal sider | 13 |