I'm trying to get all of the edge IDs from my graph into a text file without using too much memory/time overhead.
My first thought was to use lazy iteration. To do this, I create a travesal object t = g.E().id()
, and call t.next(x)
in a while loop.
But it fails for a large number of edges, with the following error:
Error in /apps/external/4/.../get_edges.groovy at [24: }] - GC overhead limit exceeded
Note that it fails inside the while loop, since it does manage to successfully write out a subset of IDS.
Here is the script I'm submitting to the gremlin console, which works for small graphs, but fails on my system for larger (millions of edges) graphs.
:remote connect tinkerpop.server conf/remote.yaml session
:remote console
chunkSize = 500
indexModToFile = 1000
idx = 0
edgesFileName = 'edges.txt'
statusFileName = 'status.txt'
new File(statusFileName).withWriter('utf-8') { def statusWriter ->
new File(edgesFileName).withWriter('utf-8') { def edgeWriter ->
t = g.E().id()
def i
while(i = t.next(chunkSize)){
i.each { def e ->
edgeWriter << e.toString() + '\n'
idx += 1
}
}
if ( idx % indexModToFile == 0 ) {
statusWriter << idx.toString() + '\n'
}
}
}
Questions:
Edit 1
I've tried export JAVA_OPTS="-Xms4G -Xmx6G"
as well (which still doesn't work) but I wouldn't have thought this would be necessary with a lazy iterator.
Why is this failing?
I wonder if you are running into memory problems even with the increased Xmx
because that script you are executing is doing all that work in a single transaction on the server? Perhaps you should try to do a g.tx().rollback()
after each batch completes to see if that were to resolve the problem.
Is there a better and faster way to extract all of the edge IDs?
If you have millions of edges then the most efficient way to do what you are doing is to use spark-gremlin. The documentation for doing so can be found here. Short of that, I'd not bother to utilize Gremlin Server and simply create a JanusGraph
instance in Gremlin Console and execute that script locally.