SolveDF: Extending Spark DataFrames with support for constrained optimization
Author
Halberg, Frederik Madsen
Term
4. term
Education
Publication year
2017
Submitted on
2017-06-11
Pages
93
Abstract
Forretningsanalyse har traditionelt bestået af deskriptiv analyse (hvad skete der?) og prædiktiv analyse (hvad kan ske?). Præskriptiv analyse går et skridt videre: den foreslår, hvilke beslutninger man bør træffe, ved at løse optimeringsproblemer. Mange præskriptive løsninger sammensættes i dag af flere specialværktøjer på en improviseret måde, hvilket er tungt og ineffektivt. Der er derfor behov for mere samlede løsninger, der dækker hele processen, herunder datahåndtering, forudsigelser og løsning af optimeringsproblemer. Dette projekt præsenterer SolveDF, et værktøj der udvider Spark SQL, så brugere kan skrive deklarative solve-forespørgsler til at specificere optimeringsproblemer med begrænsninger: man angiver mål og begrænsninger, og systemet finder den bedste gennemførlige løsning. SolveDF er inspireret af SolveDB og samler datahåndtering og optimering i det samme big data-miljø. Det udnytter Sparks distribuerede behandling ved at opdele egnede problemer i uafhængige delproblemer, som kan løses parallelt på et cluster. Ligesom Spark SQL kan det arbejde med mange datakilder, fx JSON-filer, HDFS og databaser via JDBC. Rapporten giver også en kort baggrund om Spark, løsning af optimeringsproblemer med begrænsninger samt beslægtede systemer, der integrerer data og optimering. For at informere designet gennemførte vi et lille brugervenlighedsstudie af SolveDB. Deltagerne lærte det grundlæggende hurtigt med minimal vejledning, og selve idéen og strukturen i solve-forespørgsler blev opfattet som intuitiv. Studiet afdækkede også flere mindre problemer, hvoraf nogle er adresseret i SolveDF. Ydelsesforsøg viser, at SolveDF på én maskine har tilsvarende hastighed som SolveDB for visse problemtyper. På et cluster kan det opnå næsten lineær skalering for partitionerbare, komplekse problemer (fx mixed-integer programming), med op til 6,85x hastighed på otte noder. Dog er SolveDF i øjeblikket langsommere end SolveDB til at konstruere selve optimeringsmodellerne, hvilket peger på behov for yderligere forbedringer.
Business analytics has traditionally focused on descriptive analytics (what happened) and predictive analytics (what might happen). Prescriptive analytics goes further: it recommends which actions to take by solving optimization problems. Many prescriptive solutions are currently stitched together from multiple specialized tools, which is cumbersome and inefficient. There is a need for more integrated solutions that cover the full process, including data management, prediction, and optimization. This project presents SolveDF, a tool that extends Spark SQL so users can write declarative solve queries to specify constrained optimization problems: you state objectives and constraints, and the system finds the best feasible solution. Inspired by SolveDB, SolveDF brings data management and optimization into the same big data environment. It exploits Spark’s distributed computing by partitioning suitable problems into independent subproblems that can be solved in parallel across a cluster. Like Spark SQL, it works with diverse data sources such as JSON files, HDFS, and databases via JDBC. The report also provides brief background on Spark, constrained optimization, and related systems that integrate data and optimization. To inform the design, we ran a small usability study of SolveDB. Participants learned the basics quickly with minimal guidance, and the overall concept and structure of solve queries were found intuitive. The study also surfaced several minor issues, some of which are addressed in SolveDF. Performance experiments show that, for certain problem classes, SolveDF on a single machine performs similarly to SolveDB. On a cluster it can achieve near-linear speedups for partitionable, high-complexity problems (e.g., mixed-integer programming), reaching up to 6.85x on eight nodes. However, SolveDF currently builds optimization models more slowly than SolveDB, indicating room for further improvement.
[This abstract was generated with the help of AI]
Keywords
Documents
