Search code examples
neo4jcypher

Cypher DELETE relationship is much slower than programmatic Relationship::delete


My data model is

(:Parent {parentId:...})-[:CONTAINS]->(:Child)-[:SUPPORTS]->(:Dummy)

Every Child has only one Parent. The Parent.parentId attribute is unique, i.e. there is a constraint defined as

CREATE CONSTRAINT Parent_parentId_unique IF NOT EXISTS ON (p:Parent) ASSERT p.parentId IS UNIQUE;

I have a user-defined @Procedure which accepts collection of parentIds and I want to remove all :SUPPORTS relationships from all their children.

  • When everything is in cypher, the execution of procedure is slow - hundreds milliseconds to seconds.
Result removeRelationshipResult = tx.execute(""
        + "MATCH (p:Parent)-[:CONTAINS]->(c:Child)-[r:SUPPORTS]->(:Dummy)\n"
        + "WHERE p.parentId IN $parentIds\n"
        + "DELETE r",
        Map.of("parentIds", parentIds)
);
RelationshipType CONTAINS = RelationshipType.withName("CONTAINS");
RelationshipType SUPPORTS = RelationshipType.withName("SUPPORTS");
for (Long parentId : parentIds) {
    Node parentNode = tx.findNode(Parent.LABEL, "parentId", parentId);
    streamOf(parentNode.getRelationships(Direction.OUTGOING, CONTAINS))
        .map(rel -> rel.getEndNode())
        .flatMap(childNode -> streamOf(childNode.getRelationships(SUPPORTS)))
        .forEach(Relationship::delete);
}

The difference happens even on first try when there is no :SUPPORTS relationship.

Where can be the cause of such difference and how to spot it?


UPDATE: Reaction to @cybersam's answer (too long and unformattable for comment):

I tested your suggestions on sample with 1737 Parent nodes and 655344 :SUPPORTS relations, splitted into 61 batches (useful for warmup from point #2 though primary purpose of split was different).

Applying point #1 caused huge performance improvement. The time is now comparable to programmatic implementation. I also tried to change programmatic implementation vice versa, i.e. add filtering to node labels but it did not have significant effect. Actually the time comparations differ for first run (when no relations exist) and second run (when relations from first run are actually deleted). Third and next run are similar to second run. The point #1 clearly answers my question and helped me a lot, thanks!

implementation first run: deletion time/total procedure time second run: deletion time/total procedure time [ms]
cypher-labels 79366/131261 170283/188783
cypher-nolabels 230/13756 1800/17284
program-labels 155/11731 2235/19539
program-nolabels 174/11805 2079/19111

Solution

  • There are at least 3 reasons for the speed difference.

    1. Your Cypher query specifies that the SUPPORT relationship's start node must be a Child and its end node must be a Dummy. Your Java code does not care what labels those 2 nodes have, and so has to do less work. This alternate MATCH clause would be more equivalent:

      MATCH (p:Parent)-[:CONTAINS]->()-[r:SUPPORTS]->()
      
    2. Cypher queries, when first executed (or executed again after being dropped from the Cypher query cache) must be parsed, error checked, and converted into Java operations that are optimized based on the current characteristics of the DB -- all of which takes time. Once the Cypher query's operations are cached, the same Cypher query will run faster. So, a fairer comparison would require that you run the Cypher query more than once and to ignore the timing of the first run.

    3. As is always the case with generated code, you cannot expect that the Java code generated from a Cypher query will always be as efficient as hand-crafted code that tries to do the same thing. Some additional overhead is normal.