OutOfMemoryError when importing GraphML into OrientDB

I am importing a moderate-sized GraphML file (around 8GB) in OrientDB and I keep getting the following error:

Importing GRAPHML database from database /root/neo/out.graphml...
Error: java.lang.OutOfMemoryError: GC overhead limit exceeded

I have tried connecting to my database locally or as a remote database (remote:localhost) to no avail. To be fair, connecting as remote helps but not sufficiently. Also I have tried tweaking the heap size (upped to 2048) for both console application and the database server itself. This also helped, but not sufficiently, and it is not clear for me which one helped exactly.

I am wondering which part of the importing process needs so much heap memory here, given that OrientDB itself does not use heap memory for database operations. Which agent (console that loads the graphml or database that saves the results of import) needs more heap memory here, and what is an optimal way of allocating memory to java heap in this case? And finally: Is there a difference in memory allocation needs when I connect to a database locally as compared to when I connect to the same local database on the same machine remotely (remote:localhost)?

Solution

After a good day of trial and error I made it. Here is what I tried and worked:

OrientDB's Console will load the whole GraphML file into Java heap memory before importing it in the database. It will need a maximum heap size at least as large as your GraphML file. The solution is to set the maximum heap size for the console in $orientdb/bin/console.sh. In my case this meant adding JAVA_OPTS="-Xmx8192m" to line 43 of the script.

OrientDB's console is not good with parallel processing. Although database operations are mostly IO-bound, this turns out to be a limiting factor when importing graph data. The solution is to connect to OrientDB remotely rather than natively when using console. To put it in concrete terms, instead of the suggested create database plocal:/tmp/db/test you may want to run the following command: create database remote:localhost/test USERNAME PASSWORD plocal.

It took more than 48 hours to import the whole seven gigabytes of data (more than 4 million vertices and more than 37 million edges). Vertices were imported pretty rapidly, while edges were imported at a rate about 1000 records per second (8 cores, SSD).

Here is a write-up of the whole process.