I have a large TDB dataset (cf. this post Fuseki config for 2 datasets + text index : how to use turtle files? ) and I need to extract data in order to make a "subgraph" and import it in fuseki.
I found that OFFSET
could be a solution to get all results of a query if these results are too numerous (about 12M triples).
Here are my questions :
1) I read on the W3C recommendation that OFFSET
should be used with ORDER BY
:
Using LIMIT and OFFSET (...) will not be useful unless the order is made predictable by using ORDER BY.
(cf. https://www.w3.org/TR/rdf-sparql-query/#modOffset )
-- Unfortunately, ORDER BY
seems to be very long on my dataset. I found some examples of OFFSET whithout ORDER BY (here's one : Getting list of persons using SPARQL dbpedia), so I tried to use OFFSET
alone, it seems to work.
-- I need to be sure that if I repeat the same query with, I'll get all results. Therefore I've tried on a sample, and checked that the results give distinct values and the expected number, everything seems ok. So I assume that ORDER BY is needed only if the dataset is modified between 2 queries ("predictable order")?
2) Is the performance dependant on the ratio limit/offset?
-- I tried LIMIT = 100, 1000, 5000, 10000 with the same offset, it seems to be nearly the same speed.
-- Also tried to compare different values for OFFSET, and it seems that the execution time is longer for a big offset (but maybe it's only a problem with TDB : cf : https://www.mail-archive.com/users@jena.apache.org/msg13806.html)
~~~~~~ more info ~~~~~~
-- I use a script with tdbquery
and this command :
./tdbquery --loc=$DATASET --time --results=ttl "$PREFIXES construct { ?exp dcterms:title ?titre } where { ?manif dcterms:title ?titre ; rdarelationships:expressionManifested ?exp } limit $LIMIT offset $OFFSET"
-- Dataset : ~168M triples, and ~12M triples with dcterms:title .
~~~~~~~~~~~~~~~~~~~~~~
Thanks in advance
Thank you AKSW & Andy, your comments helped me to learn about Sparql.
So I tried to use SELECT REDUCED
, but it's very long and the process can't be stopped if I don't use OFFSET
. Besides, I need to transform the results to produce a new graph (and I want to make other transformations on authors, etc).
I read a some pages about streams, models, and serialization, and found that I could transform the data directly with several updates in the same query. Here is a potential solution : first make a copy of the TDB files, and then use this query in a while loop :
DELETE {
?manif dcterms:title ?titre ;
rdarelationships:expressionManifested ?exp
}
INSERT {
graph <http://titres_1> {
?manif rdarelationships:expressionManifested ?exp .
?exp dcterms:titre ?titre
}
}
WHERE {
select * where
{
?manif dcterms:title ?titre ;
rdarelationships:expressionManifested ?exp
}
LIMIT 100000
}
This solution has several advantages :
Maybe something more efficient could be done : any idea would be appreciated.
----- EDIT ----------
I've begun to transform the data, using a bash script to repeat the query, and s-get ... | split
to export the triples in .nt files. After each export, the "temp" graph is cleared with s-update.
Everything seems to be ok, but
Therefore, 2 questions :
Thanks in advance
LAST EDIT
It seems that the files size and performance depend on parameters that can be set in a tdb.cfg
file : see http://jena.apache.org/documentation/tdb/store-parameters.html .
I didn't have any .cfg file in my dataset folder. The first test I made was to add one and change tdb.file_mode
to 'direct' : it seems that the size of the files doesn't grow as before. However, it costs more RAM and the speed for queries is lower (even if I increase java -Xms and -Xmx). I think there's a 'tradeoff' between file size and query performance. If I have time, I'll subscribe on jena-users mailing list to ask what's the best 'tuning'.
Conclusion : it was interesting to test the queries, but my dataset is too large; I'm going to make another one from the original xml files with named graphs (but using tdbloader2 doesn't allow to do so) or several smaller datasets.