Search code examples
groovygremlin

Gremlin - Memory/time efficient way to get all edge IDs from a graph


I'm trying to get all of the edge IDs from my graph into a text file without using too much memory/time overhead.

My first thought was to use lazy iteration. To do this, I create a travesal object t = g.E().id(), and call t.next(x) in a while loop.

But it fails for a large number of edges, with the following error:

Error in /apps/external/4/.../get_edges.groovy at [24: }] - GC overhead limit exceeded

Note that it fails inside the while loop, since it does manage to successfully write out a subset of IDS.

Here is the script I'm submitting to the gremlin console, which works for small graphs, but fails on my system for larger (millions of edges) graphs.

:remote connect tinkerpop.server conf/remote.yaml session
:remote console

chunkSize = 500
indexModToFile = 1000
idx = 0
edgesFileName = 'edges.txt'
statusFileName = 'status.txt'
new File(statusFileName).withWriter('utf-8') { def statusWriter ->
   new File(edgesFileName).withWriter('utf-8') { def edgeWriter ->
        t = g.E().id()
        def i
        while(i = t.next(chunkSize)){
            i.each { def e ->
                edgeWriter << e.toString() + '\n'
                idx += 1
            }
        }
        if ( idx % indexModToFile == 0 ) {
            statusWriter << idx.toString() + '\n'
        }
    }
}

Questions:

  • Why is this failing?
  • Is there a better and faster way to extract all of the edge IDs?

Edit 1

I've tried export JAVA_OPTS="-Xms4G -Xmx6G" as well (which still doesn't work) but I wouldn't have thought this would be necessary with a lazy iterator.


Solution

  • Why is this failing?

    I wonder if you are running into memory problems even with the increased Xmx because that script you are executing is doing all that work in a single transaction on the server? Perhaps you should try to do a g.tx().rollback() after each batch completes to see if that were to resolve the problem.

    Is there a better and faster way to extract all of the edge IDs?

    If you have millions of edges then the most efficient way to do what you are doing is to use spark-gremlin. The documentation for doing so can be found here. Short of that, I'd not bother to utilize Gremlin Server and simply create a JanusGraph instance in Gremlin Console and execute that script locally.