I have a large TDB dataset (cf. this post Fuseki config for 2 datasets + text index : how to use turtle files? ) and I need to extract data in order to make a "subgraph" and import it in fuseki.
I found that OFFSET
could be a solution to get all results of a query if these results are too numerous (about 12M triples).
Here are my questions :
1) I read on the W3C recommendation that OFFSET
should be used with ORDER BY
Using LIMIT and OFFSET (...) will not be useful unless the order is made predictable by using ORDER BY.
(cf. https://www.w3.org/TR/rdf-sparql-query/#modOffset )
-- Unfortunately, ORDER BY
seems to be very long on my dataset. I found some examples of OFFSET whithout ORDER BY (here's one : Getting list of persons using SPARQL dbpedia), so I tried to use OFFSET
alone, it seems to work.
-- I need to be sure that if I repeat the same query with, I'll get all results. Therefore I've tried on a sample, and checked that the results give distinct values and the expected number, everything seems ok. So I assume that ORDER BY is needed only if the dataset is modified between 2 queries ("predictable order")?
2) Is the performance dependant on the ratio limit/offset?
-- I tried LIMIT = 100, 1000, 5000, 10000 with the same offset, it seems to be nearly the same speed.
-- Also tried to compare different values for OFFSET, and it seems that the execution time is longer for a big offset (but maybe it's only a problem with TDB : cf : https://www.mail-archive.com/users@jena.apache.org/msg13806.html)
~~~~~~ more info ~~~~~~
-- I use a script with tdbquery
and this command :
./tdbquery --loc=$DATASET --time --results=ttl "$PREFIXES construct { ?exp dcterms:title ?titre } where { ?manif dcterms:title ?titre ; rdarelationships:expressionManifested ?exp } limit $LIMIT offset $OFFSET"
-- Dataset : ~168M triples, and ~12M triples with dcterms:title .
Thanks in advance
Thank you AKSW & Andy, your comments helped me to learn about Sparql.
So I tried to use SELECT REDUCED
, but it's very long and the process can't be stopped if I don't use OFFSET
. Besides, I need to transform the results to produce a new graph (and I want to make other transformations on authors, etc).
I read a some pages about streams, models, and serialization, and found that I could transform the data directly with several updates in the same query. Here is a potential solution : first make a copy of the TDB files, and then use this query in a while loop :
?manif dcterms:title ?titre ;
rdarelationships:expressionManifested ?exp
graph <http://titres_1> {
?manif rdarelationships:expressionManifested ?exp .
?exp dcterms:titre ?titre
select * where
?manif dcterms:title ?titre ;
rdarelationships:expressionManifested ?exp
LIMIT 100000
This solution has several advantages :
Maybe something more efficient could be done : any idea would be appreciated.
----- EDIT ----------
I've begun to transform the data, using a bash script to repeat the query, and s-get ... | split
to export the triples in .nt files. After each export, the "temp" graph is cleared with s-update.
Everything seems to be ok, but
Therefore, 2 questions :
Thanks in advance
It seems that the files size and performance depend on parameters that can be set in a tdb.cfg
file : see http://jena.apache.org/documentation/tdb/store-parameters.html .
I didn't have any .cfg file in my dataset folder. The first test I made was to add one and change tdb.file_mode
to 'direct' : it seems that the size of the files doesn't grow as before. However, it costs more RAM and the speed for queries is lower (even if I increase java -Xms and -Xmx). I think there's a 'tradeoff' between file size and query performance. If I have time, I'll subscribe on jena-users mailing list to ask what's the best 'tuning'.
Conclusion : it was interesting to test the queries, but my dataset is too large; I'm going to make another one from the original xml files with named graphs (but using tdbloader2 doesn't allow to do so) or several smaller datasets.