SPARQL : OFFSET whithout ORDER BY to get all results of a query?

I have a large TDB dataset (cf. this post Fuseki config for 2 datasets + text index : how to use turtle files? ) and I need to extract data in order to make a "subgraph" and import it in fuseki. I found that OFFSET could be a solution to get all results of a query if these results are too numerous (about 12M triples).

Here are my questions :

1) I read on the W3C recommendation that OFFSET should be used with ORDER BY :

Using LIMIT and OFFSET (...) will not be useful unless the order is made predictable by using ORDER BY.

(cf. https://www.w3.org/TR/rdf-sparql-query/#modOffset )

-- Unfortunately, ORDER BY seems to be very long on my dataset. I found some examples of OFFSET whithout ORDER BY (here's one : Getting list of persons using SPARQL dbpedia), so I tried to use OFFSET alone, it seems to work.

-- I need to be sure that if I repeat the same query with, I'll get all results. Therefore I've tried on a sample, and checked that the results give distinct values and the expected number, everything seems ok. So I assume that ORDER BY is needed only if the dataset is modified between 2 queries ("predictable order")?

2) Is the performance dependant on the ratio limit/offset?

-- I tried LIMIT = 100, 1000, 5000, 10000 with the same offset, it seems to be nearly the same speed.

-- Also tried to compare different values for OFFSET, and it seems that the execution time is longer for a big offset (but maybe it's only a problem with TDB : cf : https://www.mail-archive.com/users@jena.apache.org/msg13806.html)

~~~~~~ more info ~~~~~~

-- I use a script with tdbquery and this command :

./tdbquery --loc=$DATASET --time --results=ttl "$PREFIXES construct { ?exp dcterms:title ?titre } where { ?manif dcterms:title ?titre ; rdarelationships:expressionManifested ?exp } limit $LIMIT offset $OFFSET"

-- Dataset : ~168M triples, and ~12M triples with dcterms:title .

~~~~~~~~~~~~~~~~~~~~~~

Thanks in advance

Solution

Thank you AKSW & Andy, your comments helped me to learn about Sparql.

So I tried to use SELECT REDUCED, but it's very long and the process can't be stopped if I don't use OFFSET. Besides, I need to transform the results to produce a new graph (and I want to make other transformations on authors, etc).

I read a some pages about streams, models, and serialization, and found that I could transform the data directly with several updates in the same query. Here is a potential solution : first make a copy of the TDB files, and then use this query in a while loop :

DELETE {
    ?manif dcterms:title ?titre ;
        rdarelationships:expressionManifested ?exp   
}
INSERT { 
    graph <http://titres_1> { 
        ?manif rdarelationships:expressionManifested ?exp .
        ?exp dcterms:titre ?titre 
    }
} 
WHERE {  
    select * where 
        { 
            ?manif dcterms:title ?titre ;
                rdarelationships:expressionManifested ?exp 
        }
    LIMIT 100000 
}

This solution has several advantages :

very simple, no java code (I don't know Jena classes nor Java, and have no time to learn now) nor files processing.
I can stop the process when I need to.
deleting the results each time allows to be sure to retrieve all the matching triples
after each deletion the default graph becomes smaller, so the queries should be more and more efficient

Maybe something more efficient could be done : any idea would be appreciated.

----- EDIT ----------

I've begun to transform the data, using a bash script to repeat the query, and s-get ... | split to export the triples in .nt files. After each export, the "temp" graph is cleared with s-update.

Everything seems to be ok, but

it takes more time than I thought (about 1h for 50 x the query with limit = 10 000).
my TDB files are now much bigger than I thought. As if the deleted triples were not really deleted (are they stored in some "backup" graph? or maybe only the indexes are modified?). Before transformation : ~ 168 300 000 triples in default graph, and 20,6 Go for TDB files. Now : ~ 155 100 000 in default graph, and 55 Go for files...

Therefore, 2 questions :

a) Is this a "normal" behavior? Can I reduce the size of the files (not only a storage problem, I assume it should be faster for the next queries)?
b) Do you know another method, using command-line utilities, that could be faster?

Thanks in advance

LAST EDIT

It seems that the files size and performance depend on parameters that can be set in a tdb.cfg file : see http://jena.apache.org/documentation/tdb/store-parameters.html .

I didn't have any .cfg file in my dataset folder. The first test I made was to add one and change tdb.file_mode to 'direct' : it seems that the size of the files doesn't grow as before. However, it costs more RAM and the speed for queries is lower (even if I increase java -Xms and -Xmx). I think there's a 'tradeoff' between file size and query performance. If I have time, I'll subscribe on jena-users mailing list to ask what's the best 'tuning'.

Conclusion : it was interesting to test the queries, but my dataset is too large; I'm going to make another one from the original xml files with named graphs (but using tdbloader2 doesn't allow to do so) or several smaller datasets.