Datasets for modern applications are commonly distributed and increasingly too large to fit into a single server. Current distributed solutions are designed for central storage or at best static data distribution, which can result in poor query performance. Modern end-user applications, however, require results within milliseconds. Thus, there is an increasing need for intelligent and efficient data distribution and federated query engines to deal with these large amounts of data.

In this project we aim to develop generic approaches for the automatic redistribution and federated querying of large distributed datasets to facilitate the development of high-performance distributed data storage solutions. The final output will be a set of W3C-standard-conformant tools that implement automatic data distribution, federated query planning and execution, dynamic data exchange mechanisms, data storage profiling (containing useful information/statistics about the underlying data) and data monitoring. 

Project Goals:

The main goal of the project is to develop novel generic solutions for high-performance RDF data management. Main results include: 

1) Automatic data distributor and dynamic data exchanger: Algorithms for the automatic redistribution of large linked datasets among storage solutions will facilitate the development of high-performance federation engines. We will dynamically exchange data between storage solutions and exploit data locality to balance the amount of computation in a single storage solution. Data security will be based on predefined policies. 

2) Data storage monitoring and automatic profile creator: The decision upon dynamic exchange will be based on monitoring, esp. of the queries that were issued to the distributed storage solution. Moreover, we will create automatic data storage profiles (e.g. VoID, DCAT stats & other relevant metadata, e.g., selectivity information). Our SPARQL federation engine will use this information during query planning. In addition, it can be used in Ontologies Based Data Access and query relaxation etc. 

3) A complete SPARQL federation engine: This solution will exploit 1) and 2) to efficiently process federated SPARQL queries over sets of large linked data stores. The source selection and optimized query execution plan generation will be based on component 2). Component 1) will help in migrating the join computation to servers, and thus reducing the amount of intermediate results, network traffic and improving the query runtime performance. We expect to significantly reduce the query runtime, the number of intermediate results, the number of server requests, and the network traffic generated during federated queries processing over large linked datasets. In addition, we expect to improve the availability of the storage solutions with respect to their query load. The solution will be validated and showcased in real use-cases and will be integrated into existing widely used query and analytics applications.

This project is funded by the German Federal Ministry of Education and Research (BMBF) under grant number:01QE2114A
This project is funded by the Eurostars Programme under project number: E114681