SolveDF: Extending Spark DataFrames with support for constrained optimization

Author

Halberg, Frederik Madsen

Term

4. term

Education

Computer Science, Master

Publication year

2017

Submitted on

2017-06-11

Pages

Abstract

Prescriptive Analytics (PA) is an emerging phase of Business Analytics (BA), which has traditionally consisted of Descriptive Analytics (DA) and Predictive Analytics (PR). Whereas DA and PR are concerned with understanding the past and the future, PA is concerned with providing direct support for decision making, by suggesting (prescribing) optimal decisions to make for a given business problem. However, existing PA solutions often consist of several specialized tools glued together in an improvised manner, which is cumbersome and ineffective. There is a need for more integrated solutions that support all the necessary steps for PA, including data management, prediction, and optimization problem solving. This project details the design and implementation of SolveDF, a tool that extends Spark SQL with functionality that allows for declarative specification of constrained optimization problems through solve queries. SolveDF is heavily inspired by SolveDB, and allows for data management and constrained optimization to be performed seamlessly in a Big Data environment. SolveDF can leverage the distributed nature of Spark by splitting optimization problems into smaller independent subproblems that can be solved in parallel on a cluster. Like Spark SQL, SolveDF is not limited to a single type of data source, but can be used with many different types of data sources, including JSON-files, HDFS and any DBMS that supports JDBC. The report also includes a brief overview of Spark and constrained optimization problem solving, as well as related work in the area of data management systems with integrated support for constrained optimization problem solving. As a part of designing SolveDF, a small usability experiment of SolveDB is performed to evaluate how intuitive SolveDB is. The results suggest that SolveDB can be learned quickly with minimal guidance, and that the overall concept and structure of solve queries make sense. The experiment also identified a number of small problems encountered when using SolveDB, and some of these problems are addressed in SolveDF. Performance experiments of SolveDF show that for certain types of problems, SolveDF has similar performance to SolveDB when running on a single machine. The results also show that when running SolveDF on a cluster, there appears to be a linear speedup relative to the amount of nodes in the cluster for certain problems, as shown by SolveDF being up to 6.85 times faster on a cluster of 8 nodes. In particular, optimization problems that are partitionable and have high complexity (e.g. Mixed integer programming problems) are ideal problems for SolveDF to solve. The results also show that SolveDF could still use more work, as SolveDF is relatively slow at constructing optimization problems compared to SolveDB.

Keywords

Prescriptive analytics ; Apache Spark ; Big Data ; Constrained optimization ; SQL ; SolveDB

Documents

Download
View record in AAU Student Projects

A master's thesis from Aalborg University

SolveDF: Extending Spark DataFrames with support for constrained optimization