My data model is
(:Parent {parentId:...})-[:CONTAINS]->(:Child)-[:SUPPORTS]->(:Dummy)
Every Child
has only one Parent
. The Parent.parentId
attribute is unique, i.e. there is a constraint defined as
CREATE CONSTRAINT Parent_parentId_unique IF NOT EXISTS ON (p:Parent) ASSERT p.parentId IS UNIQUE;
I have a user-defined @Procedure
which accepts collection of parentIds
and I want to remove all :SUPPORTS
relationships from all their children.
Result removeRelationshipResult = tx.execute(""
+ "MATCH (p:Parent)-[:CONTAINS]->(c:Child)-[r:SUPPORTS]->(:Dummy)\n"
+ "WHERE p.parentId IN $parentIds\n"
+ "DELETE r",
Map.of("parentIds", parentIds)
);
streamOf
is utility method to convert Iterable to Stream).RelationshipType CONTAINS = RelationshipType.withName("CONTAINS");
RelationshipType SUPPORTS = RelationshipType.withName("SUPPORTS");
for (Long parentId : parentIds) {
Node parentNode = tx.findNode(Parent.LABEL, "parentId", parentId);
streamOf(parentNode.getRelationships(Direction.OUTGOING, CONTAINS))
.map(rel -> rel.getEndNode())
.flatMap(childNode -> streamOf(childNode.getRelationships(SUPPORTS)))
.forEach(Relationship::delete);
}
The difference happens even on first try when there is no :SUPPORTS
relationship.
Where can be the cause of such difference and how to spot it?
UPDATE: Reaction to @cybersam's answer (too long and unformattable for comment):
I tested your suggestions on sample with 1737 Parent
nodes and 655344 :SUPPORTS
relations, splitted into 61 batches (useful for warmup from point #2 though primary purpose of split was different).
Applying point #1 caused huge performance improvement. The time is now comparable to programmatic implementation. I also tried to change programmatic implementation vice versa, i.e. add filtering to node labels but it did not have significant effect. Actually the time comparations differ for first run (when no relations exist) and second run (when relations from first run are actually deleted). Third and next run are similar to second run. The point #1 clearly answers my question and helped me a lot, thanks!
implementation | first run: deletion time/total procedure time | second run: deletion time/total procedure time [ms] |
---|---|---|
cypher-labels | 79366/131261 | 170283/188783 |
cypher-nolabels | 230/13756 | 1800/17284 |
program-labels | 155/11731 | 2235/19539 |
program-nolabels | 174/11805 | 2079/19111 |
There are at least 3 reasons for the speed difference.
Your Cypher query specifies that the SUPPORT
relationship's start node must be a Child
and its end node must be a Dummy
. Your Java code does not care what labels those 2 nodes have, and so has to do less work. This alternate MATCH
clause would be more equivalent:
MATCH (p:Parent)-[:CONTAINS]->()-[r:SUPPORTS]->()
Cypher queries, when first executed (or executed again after being dropped from the Cypher query cache) must be parsed, error checked, and converted into Java operations that are optimized based on the current characteristics of the DB -- all of which takes time. Once the Cypher query's operations are cached, the same Cypher query will run faster. So, a fairer comparison would require that you run the Cypher query more than once and to ignore the timing of the first run.
As is always the case with generated code, you cannot expect that the Java code generated from a Cypher query will always be as efficient as hand-crafted code that tries to do the same thing. Some additional overhead is normal.