Search code examples
sparqlrdfowlgraphdb

Can GraphDB load 10 million statements with OWL reasoning?


I am struggling to load most of the Drug Ontology OWL files and most of the ChEBI OWL files into GraphDB free v8.3 repository with Optimized OWL Horst reasoning on.

is this possible? Should I do something other than "be patient?"

Details:

I'm using the loadrdf offline bulk loader to populate an AWS r4.16xlarge instance with 488.0 GiB and 64 vCPUs

Over the weekend, I played around with different pool buffer sizes and found that most of these files individually load fastest with a pool buffer of 2,000 or 20,000 statements instead of the suggested 200,000. I also added -Xmx470g to the loadrdf script. Most of the OWL files would load individually in less than one hour.

Around 10 pm EDT last night, I started to load all of the files listed below simultaneously. Now it's 11 hours later, and there are still millions of statements to go. The load rate is around 70/second now. It appears that only 30% of my RAM is being used, but the CPU load is consistently around 60.

  • are there websites that document other people doing something of this scale?
  • should I be using a different reasoning configuration? I chose this configuration as it was the fastest loading OWL configuration, based on my experiments over the weekend. I think I will need to look for relationships that go beyond rdfs:subClassOf.

Files I'm trying to load:

+-------------+------------+---------------------+
|    bytes    | statements |        file         |
+-------------+------------+---------------------+
| 471,265,716 | 4,268,532  | chebi.owl           |
| 61,529      | 451        | chebi-disjoints.owl |
| 82,449      | 1,076      | chebi-proteins.owl  |
| 10,237,338  | 135,369    | dron-chebi.owl      |
| 2,374       | 16         | dron-full.owl       |
| 170,896     | 2,257      | dron-hand.owl       |
| 140,434,070 | 1,986,609  | dron-ingredient.owl |
| 2,391       | 16         | dron-lite.owl       |
| 234,853,064 | 2,495,144  | dron-ndc.owl        |
| 4,970       | 28         | dron-pro.owl        |
| 37,198,480  | 301,031    | dron-rxnorm.owl     |
| 137,507     | 1,228      | dron-upper.owl      |
+-------------+------------+---------------------+

Solution

  • @MarkMiller you can take a look at the Preload tool, which is part of GraphDB 8.4.0 release. It's specially designed to handle large amount of data with constant speed. Note that it works without inference, so you'll need to load your data and then change the ruleset and reinfer the statements.

    http://graphdb.ontotext.com/documentation/free/loading-data-using-preload.html