Search code examples
dbpediagraphdb

Stuck loading large dataset with GraphDB


When I load this DBpedia (2015-10, en, ~1 billion triples ) into GraphDB 9.1.1 the CPU load drops to 0% after around 13M triples and idles henceforth. The process does not terminate until I kill it manually.

The machine has enough disc space and sufficient more RAM than the 512GB assigned via the Xmx CMD option to java.

The file that I tried to load is provided here: https://hobbitdata.informatik.uni-leipzig.de/dbpedia_2015-10_en_wo-comments_c.nt.zst

It can be decompressed with:

zstd -d "dbpedia_2015-10_en_wo-comments_c.nt.zst" -o "dbpedia_2015-10_en_wo-comments_c.nt"

I use the following command to load the data:

java -Xmx512G -cp "$HOME/graphdb/graphdb-free-9.1.1/lib/*" -Dgraphdb.dist=$HOME/graphdb/graphdb-free-9.1.1 -Dgraphdb.home.data=$HOME/dbpedia2015/data/ -Djdk.xml.entityExpansionLimit=0 com.ontotext.graphdb.loadrdf.LoadRDF -f -m parallel -p -c $HOME/graphdb/graphdb-dbpedia2015.ttl $HOME/dbpedia_2015-10_en_wo-comments_c.nt

$HOME/graphdb/graphdb-dbpedia2015.ttl looks like:

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.
@prefix rep: <http://www.openrdf.org/config/repository#>.
@prefix sr: <http://www.openrdf.org/config/repository/sail#>.
@prefix sail: <http://www.openrdf.org/config/sail#>.
@prefix owlim: <http://www.ontotext.com/trree/owlim#>.

[] a rep:Repository ;
    rep:repositoryID "dbpedia2015" ;
    rdfs:label "Repository for dataset dbpedia2015" ;
    rep:repositoryImpl [
        rep:repositoryType "graphdb:FreeSailRepository" ;
        sr:sailImpl [
            sail:sailType "graphdb:FreeSail" ;

                        # ruleset to use
                        owlim:ruleset "rdfsplus-optimized" ;

                        # disable context index(because my data do not uses contexts)
                        owlim:enable-context-index "false" ;

                        # indexes to speed up the read queries
                        owlim:enablePredicateList "true" ;
                        owlim:enable-literal-index "true" ;
                        owlim:in-memory-literal-properties "true" ;
        ]
    ].

The log of the output is:

16:11:07.438 [main] INFO  com.ontotext.graphdb.loadrdf.Params - MODE: parallel
16:11:07.439 [main] INFO  com.ontotext.graphdb.loadrdf.Params - STOP ON FIRST ERROR: false
16:11:07.439 [main] INFO  com.ontotext.graphdb.loadrdf.Params - PARTIAL LOAD: true
16:11:07.439 [main] INFO  com.ontotext.graphdb.loadrdf.Params - CONFIG FILE: /home/me/graphdb-dbpedia2015.ttl
16:11:07.444 [main] INFO  com.ontotext.graphdb.loadrdf.LoadRDF - Attaching to location: /home/me/graphdb/dbpedia2015/data
16:11:07.618 [main] INFO  c.o.t.u.l.LimitedObjectCacheFactory - Using LRU cache type: synch
16:11:08.025 [main] WARN  com.ontotext.plugin.literals-index - Rebuilding literals indexes. Starting from id:1
16:11:08.029 [main] WARN  com.ontotext.plugin.literals-index - Complete in 0.004, num entries indexed:0
16:11:08.780 [main] INFO  c.o.rio.parallel.ParallelLoader - Data will be parsed + resolved + loaded.
16:11:08.788 [main] INFO  c.o.rio.parallel.ParallelLoader - Using 128 threads for inference
16:11:09.984 [main] INFO  com.ontotext.graphdb.loadrdf.LoadRDF - Loading file: dbpedia_2015-10_en_wo-comments_c.nt
16:11:09.991 [main] INFO  c.o.rio.parallel.ParallelLoader - Using 128 threads for inference
16:11:19.987 [main] INFO  c.o.rio.parallel.ParallelRDFInserter - Parsed 2,111,690 stmts. Rate: 211,147 st/s. Statements overall: 2,111,690. Global average rate: 211,000 st/s. Now: Tue Mar 10 16:11:19 UTC 2020. Total memory: 22144M, Free memory: 4890M, Max memory: 524288M.
16:11:30.515 [main] INFO  c.o.rio.parallel.ParallelRDFInserter - Parsed 3,955,363 stmts. Rate: 192,662 st/s. Statements overall: 3,955,363. Global average rate: 192,596 st/s. Now: Tue Mar 10 16:11:30 UTC 2020. Total memory: 66432M, Free memory: 53925M, Max memory: 524288M.
16:11:40.515 [main] INFO  c.o.rio.parallel.ParallelRDFInserter - Parsed 6,889,662 stmts. Rate: 225,661 st/s. Statements overall: 6,889,662. Global average rate: 225,609 st/s. Now: Tue Mar 10 16:11:40 UTC 2020. Total memory: 199296M, Free memory: 177241M, Max memory: 524288M.
16:11:51.185 [main] INFO  c.o.rio.parallel.ParallelRDFInserter - Parsed 9,124,978 stmts. Rate: 221,474 st/s. Statements overall: 9,124,978. Global average rate: 221,437 st/s. Now: Tue Mar 10 16:11:51 UTC 2020. Total memory: 199296M, Free memory: 185106M, Max memory: 524288M.
16:12:02.877 [main] INFO  c.o.rio.parallel.ParallelRDFInserter - Parsed 11,083,153 stmts. Rate: 209,539 st/s. Statements overall: 11,083,153. Global average rate: 209,511 st/s. Now: Tue Mar 10 16:12:02 UTC 2020. Total memory: 199296M, Free memory: 184331M, Max memory: 524288M.
16:12:15.800 [main] INFO  c.o.rio.parallel.ParallelRDFInserter - Parsed 13,166,352 stmts. Rate: 200,047 st/s. Statements overall: 13,166,352. Global average rate: 200,026 st/s. Now: Tue Mar 10 16:12:15 UTC 2020. Total memory: 329312M, Free memory: 313496M, Max memory: 524288M.

Any idea why it is stuck after around 13M triples?


Solution

  • First - assign less Xmx to the process (around 38-42 GB would be enough). The database will need additional memory for the off heap so be sure to not assign all of your memory. If you still cannot load the dataset could you please send the jstack of the process or you can use the Java Flight Records if you use Oracle JDK:

    jcmd <pid> VM.unlock_commercial_features
    jcmd <pid> JFR.start duration=60s name=production filename=production.jfr settings=profile
    

    Set the duration to a value, which would allow the trace of the execution. You can send the results to support@ontotext.com as it will contain information about your environment.

    Another alternative is to use the Preload tool - it's purpose is loading large datasets - http://graphdb.ontotext.com/documentation/enterprise/loading-data-using-preload.html