SFRDF+: Join Plans for SPARQL Processing in Apache Flink

Student thesis: Master thesis (including HD thesis)

  • Jesper Clausen
  • Mathias Eriksen Otkjær
4. term, Software, Master (Master Programme)
RDF data is becoming increasingly popular as a model for representing unstructured data on the Web. The data sets therefore reaches web-scale sizes in the form of RDF graphs with billions of triples. In order to handle such large data sets distributed processing systems are needed. SFRDF is one of such systems, which is based on the distributed framework called Apache Flink and the partitioning technique known as ExtVP. SFRDF showed to nearly be competitive with state of the art systems during its creation in the fall of 2016. We propose an improved version of SFRDF, SFRDF+, which implements several improvements to the original system. The improvements include simple changes such as introducing dictionary encoding, but also more advanced features such as introducing join order optimizations in order to generate better query plans. In order to find the best approach for generation of query plans we implement and evaluate different approaches from within the area of RDF processing, i.e. CliqueSquare, and traditional database management systems, i.e. DPCCP and a greedy approach. We evaluate the dictionary encoding and the different approaches for join order optimizations and learn that the only feasible approach for our system is the greedy one. We modify the cost function for the greedy approach to prefer bushy plans to see if these yield better performance. This is not the case for the plans generated by our algorithm, but the standard greedy algorithm shows promising results with an overall reduction to query response times.
Publication date5 Jun 2017
Number of pages64
ID: 259181060