Search code examples
sparqlsemantic-webjenafuseki

Querying large RDF Datasets out of memory


I want to download two or more datasets on my machine and be able to start a SPARQL endpoint for each. I tried Fuseki which is part of the Jena project. However, it loads the whole dataset in memory, which is not very much desired if I'm intending to query large datasets like DBpedia given that I intend to do other stuff (starting multiple SPARQL endpoints and use a federated query system over them).

Just to give you a heads up, I intend to link multiple datasets using SILK, querying them using a FEDX federated query system. If you recommend any change of the systems I'm using, or can give me a tip, that would be great. It will also be great of a help if you suggest a dataset that can fit in this project.


Solution

  • Jena's Fuseki can use TDB as a storage mechanism, and TDB stores things on disk. The TDB docmentation on caching on 32 and 64 bit Java systems discusses the way that the file contents are mapped into memory. I do not believe that TDB/Fuseki loads the entire dataset into memory; this just is not feasible for large datasets, yet TDB can handle rather large datasets. I think what you should consider doing is using tdbloader to create a TDB store; then you can point Fuseki to it.

    There's an example of setting up a TDB store in this answer. In there, the query is performed with tdbquery, but according to the Running a Fuseki server section of the documentation, all you will need to do to start Fuseki with the same TDB store is use the --loc=DIR option:

    • --loc=DIR
      Use an existing TDB database. Create an empty one if it does not exist.