Search code examples
neo4jneo4j-embedded

embedded neo4j: iterate over all nodes of a huge graph


I would like to iterate over all nodes in a ~100Mio-graph. I know I can get the nodes with the cypher-query

    MATCH n RETURN n

but then I would have to use LIMIT and SKIP to work myself through the dataset and I think there are performance-issues with this approach.

Now my question is: How can I iterate over all nodes using an embedded neo4j database? The whole thing will be a background job (indexing nodes to elasticsearch).


Solution

  • Thanks guys for mentioning GraphAware, just to throw another approach into the mix: The problem with getting all nodes with vanilla GlobalGraphOperations is that it all happens in a single transaction. On a graph with 100M nodes, this won't work.

    For this reason, GraphAware Framework has a number of BatchTransactionExecutors that we're using in our modules for re-indexing / recover and such scenarios where you need to do something for each node / relationship or a subset of these.

    Let me post an example of how you would use this - it's from GraphAware's Schema Enforcement Module (not open source, hence posting here):

        final List<String> violations = new LinkedList<>();
    
        new IterableInputBatchTransactionExecutor<>(database, 1000, 
                new AllNodes(database, 1000),
                new UnitOfWork<Node>() {
                    @Override
                    public void execute(GraphDatabaseService database, Node input, int batchNumber, int stepNumber) {
                        for (Constraint<Node> constraint : nodeConstraints) {
                            if (!constraint.satisfiedBy(input)) {
                                violations.add(input + " violates " + constraint.toString());
                            }
                        }
                    }
                }).execute();
    

    Most of the input parameters should be self-explanatory. Note that AllNodes is another framework class which fetches all nodes from the database in batches of 1000 (in this case) per transaction. We provide others (AllNodesWithLabel, AllRelationships), but you can easily implement your own.

    Doing this in the background is then a matter of creating a separate thread, or if you wanna get more sophisticated, use the framework's timer-driven modules as William already pointed out.