Search code examples
orientdbgraph-databasespyorient

How to optimize the process of importing a large graph with one hundred thousand vertexes and half a million edges into Orientdb by pyorient?


Orientdb : 2.1.3
Pyorient : 1.4.7

I need to import a graph with one hundred thousand vertexs and half a million edges into Orientdb by pyorient.

db.command one by one

Firstly, I just use db.command("create vertex V set a=1") to insert all the vertexes and edges one by one.

But it takes me about two hours.

So I want to find a way to optimize this process.

Massive Insert ?

Then I find that Orientdb supports Massive Insert, but unfortunately the author of pyorient in an issue massive insertion: no transacations? mentioned that

in the bynary protocol ( and in pyorient of course ) there is not the massive insert intent.

SQL batch

Pyorient supports sql batch. Maybe this is an opportunity!

I just put all the insert commands together and run it by db.batch().

I take a graph with 5000 vertexes and 20000 edges for example

  • sql batch

    vertexs : 25.1708816278 s
    edges : 254.248636227 s
    
  • original

    constrct vertexs : 19.5094766904 s
    construct edges : 147.627924276 s
    

..it seems that sql batch costs much more time.

So I want to know whether there is a way to do it.

Thanks.


Solution

  • When you make the one by one entry, you've already tried to see if you get better performance using Transactional Graph and commit every X items ?? Usually this is the correct way to insert a lot of data. Unfortunately using pyorient, as you also indicated you, the Massive Insert you can not use it and also Multi-process approaches are unable to exploit (the socket connection is only one and all your concurrent objects will be serialized ( as for a pipeline ) because a connection pool is not implemented in the driver. So you can loose the performance advantages of the multiprocessing).