SolveDF: Extending Spark DataFrames with support for constrained optimization

Studenteropgave: Kandidatspeciale og HD afgangsprojekt

  • Frederik Madsen Halberg
4. semester, Datalogi, Kandidat (Kandidatuddannelse)
Prescriptive Analytics (PA) is an emerging phase of Business Analytics (BA), which has
traditionally consisted of Descriptive Analytics (DA) and Predictive Analytics (PR).
Whereas DA and PR are concerned with understanding the past and the future, PA is
concerned with providing direct support for decision making, by suggesting (prescribing)
optimal decisions to make for a given business problem. However, existing PA solutions
often consist of several specialized tools glued together in an improvised manner, which is
cumbersome and ineffective. There is a need for more integrated solutions that support
all the necessary steps for PA, including data management, prediction, and optimization
problem solving.
This project details the design and implementation of SolveDF, a tool that extends
Spark SQL with functionality that allows for declarative specification of constrained optimization
problems through solve queries. SolveDF is heavily inspired by SolveDB, and
allows for data management and constrained optimization to be performed seamlessly in
a Big Data environment. SolveDF can leverage the distributed nature of Spark by splitting
optimization problems into smaller independent subproblems that can be solved in
parallel on a cluster. Like Spark SQL, SolveDF is not limited to a single type of data
source, but can be used with many different types of data sources, including JSON-files,
HDFS and any DBMS that supports JDBC.
The report also includes a brief overview of Spark and constrained optimization problem
solving, as well as related work in the area of data management systems with integrated
support for constrained optimization problem solving.
As a part of designing SolveDF, a small usability experiment of SolveDB is performed
to evaluate how intuitive SolveDB is. The results suggest that SolveDB can be learned
quickly with minimal guidance, and that the overall concept and structure of solve queries
make sense. The experiment also identified a number of small problems encountered
when using SolveDB, and some of these problems are addressed in SolveDF.
Performance experiments of SolveDF show that for certain types of problems, SolveDF
has similar performance to SolveDB when running on a single machine. The results also
show that when running SolveDF on a cluster, there appears to be a linear speedup
relative to the amount of nodes in the cluster for certain problems, as shown by SolveDF
being up to 6.85 times faster on a cluster of 8 nodes. In particular, optimization problems
that are partitionable and have high complexity (e.g. Mixed integer programming
problems) are ideal problems for SolveDF to solve. The results also show that SolveDF
could still use more work, as SolveDF is relatively slow at constructing optimization
problems compared to SolveDB.
Udgivelsesdato11 jun. 2017
Antal sider93
ID: 259535551