Speed up Sesame load of rdf files

Is there any way to speed up the load of rdf files into Sesame? I have files ranging in size from a few of MB to a couple of GB in N-triple format. I have tried the three first approaches in Sesame Cook Book, but to no avail. I loaded a ~700MB file in 17 hours by splitting the input file at every 500,000th line (approach 2 in the cook book). Sesame is running on a commodity machine with Windows 7.

Bonus part: I want to perform inference on the data, but store the inferred data in a separate sesame repository (or alternatively in another context/graph in the same repository). Essentially I want to store the data in two versions, one which is "regular" rdf and one which is optimized for certain queries - hence the need for storing them separately. I have been looking at the CustomGraphQueryInferencer, but have not figured out if I can use this to store the data separately. Furthermore, the CustomGraphQueryInferencer seems to slow down the load time greatly, thus making it very unattractive. Any alternative solutions?

Solution

Inserting 500k triples in 17 hours is absurdly bad; that's about 8 triples/sec. Sesame, to my knowledge, does not have a bulk insert mode, but there's no way you should be seeing load rates which are that slow.

You might make sure you don't have autoCommit on; that'd be doing commits for each triple, which could go a long way toward explaining why your load rate is so uncommonly poor.

With respect to reasoning, another factor for the poor load rate is that you are using an inferencer that performs materialization. That is, each time you write to the database, inferred statements are (re)calculated and saved back into the database. Further, the inferencer you've chosen to use is based on queries, so your loads into the database are hampered by query answering, truth maintenance, and materialization.

That is probably a large part of the poor load rate, although, it still seems even too slow for that. But perhaps combined with autoCommit being enabled, that might explain it.

You might be able to add the inferencer after all the data is loaded, I don't know enough about how that particular inferencer works to know if that is correct, but in theory, it's certainly possible. The Sesame mailing list might have more details about how it works.

You can also consider a solution which performs reasoning at query time rather than load time; this does not have the costly overhead for writes, and also allows you to use, or not use, reasoning whenever is most appropriate for your application. That'd effectively let you have your two 'versions' of the data, one with reasoning applied and one without, without actually having to have two versions or materialize the inferences.