I would like to iterate over all nodes in a ~100Mio-graph. I know I can get the nodes with the cypher-query
MATCH n RETURN n
but then I would have to use LIMIT and SKIP to work myself through the dataset and I think there are performance-issues with this approach.
Now my question is: How can I iterate over all nodes using an embedded neo4j database? The whole thing will be a background job (indexing nodes to elasticsearch).
Thanks guys for mentioning GraphAware, just to throw another approach into the mix: The problem with getting all nodes with vanilla GlobalGraphOperations
is that it all happens in a single transaction. On a graph with 100M nodes, this won't work.
For this reason, GraphAware Framework has a number of BatchTransactionExecutor
s that we're using in our modules for re-indexing / recover and such scenarios where you need to do something for each node / relationship or a subset of these.
Let me post an example of how you would use this - it's from GraphAware's Schema Enforcement Module (not open source, hence posting here):
final List<String> violations = new LinkedList<>();
new IterableInputBatchTransactionExecutor<>(database, 1000,
new AllNodes(database, 1000),
new UnitOfWork<Node>() {
@Override
public void execute(GraphDatabaseService database, Node input, int batchNumber, int stepNumber) {
for (Constraint<Node> constraint : nodeConstraints) {
if (!constraint.satisfiedBy(input)) {
violations.add(input + " violates " + constraint.toString());
}
}
}
}).execute();
Most of the input parameters should be self-explanatory. Note that AllNodes
is another framework class which fetches all nodes from the database
in batches of 1000 (in this case) per transaction. We provide others (AllNodesWithLabel
, AllRelationships
), but you can easily implement your own.
Doing this in the background is then a matter of creating a separate thread, or if you wanna get more sophisticated, use the framework's timer-driven modules as William already pointed out.