When I load this DBpedia (2015-10, en, ~1 billion triples ) into GraphDB 9.1.1 the CPU load drops to 0% after around 13M triples and idles henceforth. The process does not terminate until I kill it manually.
The machine has enough disc space and sufficient more RAM than the 512GB assigned via the Xmx CMD option to java.
The file that I tried to load is provided here: https://hobbitdata.informatik.uni-leipzig.de/dbpedia_2015-10_en_wo-comments_c.nt.zst
It can be decompressed with:
zstd -d "dbpedia_2015-10_en_wo-comments_c.nt.zst" -o "dbpedia_2015-10_en_wo-comments_c.nt"
I use the following command to load the data:
java -Xmx512G -cp "$HOME/graphdb/graphdb-free-9.1.1/lib/*" -Dgraphdb.dist=$HOME/graphdb/graphdb-free-9.1.1 -Dgraphdb.home.data=$HOME/dbpedia2015/data/ -Djdk.xml.entityExpansionLimit=0 com.ontotext.graphdb.loadrdf.LoadRDF -f -m parallel -p -c $HOME/graphdb/graphdb-dbpedia2015.ttl $HOME/dbpedia_2015-10_en_wo-comments_c.nt
$HOME/graphdb/graphdb-dbpedia2015.ttl
looks like:
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.
@prefix rep: <http://www.openrdf.org/config/repository#>.
@prefix sr: <http://www.openrdf.org/config/repository/sail#>.
@prefix sail: <http://www.openrdf.org/config/sail#>.
@prefix owlim: <http://www.ontotext.com/trree/owlim#>.
[] a rep:Repository ;
rep:repositoryID "dbpedia2015" ;
rdfs:label "Repository for dataset dbpedia2015" ;
rep:repositoryImpl [
rep:repositoryType "graphdb:FreeSailRepository" ;
sr:sailImpl [
sail:sailType "graphdb:FreeSail" ;
# ruleset to use
owlim:ruleset "rdfsplus-optimized" ;
# disable context index(because my data do not uses contexts)
owlim:enable-context-index "false" ;
# indexes to speed up the read queries
owlim:enablePredicateList "true" ;
owlim:enable-literal-index "true" ;
owlim:in-memory-literal-properties "true" ;
]
].
The log of the output is:
16:11:07.438 [main] INFO com.ontotext.graphdb.loadrdf.Params - MODE: parallel
16:11:07.439 [main] INFO com.ontotext.graphdb.loadrdf.Params - STOP ON FIRST ERROR: false
16:11:07.439 [main] INFO com.ontotext.graphdb.loadrdf.Params - PARTIAL LOAD: true
16:11:07.439 [main] INFO com.ontotext.graphdb.loadrdf.Params - CONFIG FILE: /home/me/graphdb-dbpedia2015.ttl
16:11:07.444 [main] INFO com.ontotext.graphdb.loadrdf.LoadRDF - Attaching to location: /home/me/graphdb/dbpedia2015/data
16:11:07.618 [main] INFO c.o.t.u.l.LimitedObjectCacheFactory - Using LRU cache type: synch
16:11:08.025 [main] WARN com.ontotext.plugin.literals-index - Rebuilding literals indexes. Starting from id:1
16:11:08.029 [main] WARN com.ontotext.plugin.literals-index - Complete in 0.004, num entries indexed:0
16:11:08.780 [main] INFO c.o.rio.parallel.ParallelLoader - Data will be parsed + resolved + loaded.
16:11:08.788 [main] INFO c.o.rio.parallel.ParallelLoader - Using 128 threads for inference
16:11:09.984 [main] INFO com.ontotext.graphdb.loadrdf.LoadRDF - Loading file: dbpedia_2015-10_en_wo-comments_c.nt
16:11:09.991 [main] INFO c.o.rio.parallel.ParallelLoader - Using 128 threads for inference
16:11:19.987 [main] INFO c.o.rio.parallel.ParallelRDFInserter - Parsed 2,111,690 stmts. Rate: 211,147 st/s. Statements overall: 2,111,690. Global average rate: 211,000 st/s. Now: Tue Mar 10 16:11:19 UTC 2020. Total memory: 22144M, Free memory: 4890M, Max memory: 524288M.
16:11:30.515 [main] INFO c.o.rio.parallel.ParallelRDFInserter - Parsed 3,955,363 stmts. Rate: 192,662 st/s. Statements overall: 3,955,363. Global average rate: 192,596 st/s. Now: Tue Mar 10 16:11:30 UTC 2020. Total memory: 66432M, Free memory: 53925M, Max memory: 524288M.
16:11:40.515 [main] INFO c.o.rio.parallel.ParallelRDFInserter - Parsed 6,889,662 stmts. Rate: 225,661 st/s. Statements overall: 6,889,662. Global average rate: 225,609 st/s. Now: Tue Mar 10 16:11:40 UTC 2020. Total memory: 199296M, Free memory: 177241M, Max memory: 524288M.
16:11:51.185 [main] INFO c.o.rio.parallel.ParallelRDFInserter - Parsed 9,124,978 stmts. Rate: 221,474 st/s. Statements overall: 9,124,978. Global average rate: 221,437 st/s. Now: Tue Mar 10 16:11:51 UTC 2020. Total memory: 199296M, Free memory: 185106M, Max memory: 524288M.
16:12:02.877 [main] INFO c.o.rio.parallel.ParallelRDFInserter - Parsed 11,083,153 stmts. Rate: 209,539 st/s. Statements overall: 11,083,153. Global average rate: 209,511 st/s. Now: Tue Mar 10 16:12:02 UTC 2020. Total memory: 199296M, Free memory: 184331M, Max memory: 524288M.
16:12:15.800 [main] INFO c.o.rio.parallel.ParallelRDFInserter - Parsed 13,166,352 stmts. Rate: 200,047 st/s. Statements overall: 13,166,352. Global average rate: 200,026 st/s. Now: Tue Mar 10 16:12:15 UTC 2020. Total memory: 329312M, Free memory: 313496M, Max memory: 524288M.
Any idea why it is stuck after around 13M triples?
First - assign less Xmx to the process (around 38-42 GB would be enough). The database will need additional memory for the off heap so be sure to not assign all of your memory. If you still cannot load the dataset could you please send the jstack of the process or you can use the Java Flight Records if you use Oracle JDK:
jcmd <pid> VM.unlock_commercial_features
jcmd <pid> JFR.start duration=60s name=production filename=production.jfr settings=profile
Set the duration to a value, which would allow the trace of the execution. You can send the results to support@ontotext.com as it will contain information about your environment.
Another alternative is to use the Preload tool - it's purpose is loading large datasets - http://graphdb.ontotext.com/documentation/enterprise/loading-data-using-preload.html