Search code examples
performanceoptimizationrediscypherredisgraph

Cypher: slow query optimization


I am using redisgraph with a custom implementation of ioredis. The query runs 3 to 6 seconds on a database that has millions of nodes. It basically filters (b:brand) by different relationship counts by adding the following match and where multiple times on different nodes.

(:brand) - 1mil nodes
(:w) - 20mil nodes
(:e) - 10mil nodes
// matching b before this codeblock
MATCH (b)-[:r1]->(p:p)<-[:r2]-(w:w)
WHERE w.deleted IS NULL
WITH count(DISTINCT w) as count, b
WHERE  count >= 0   AND count <= 10

The full query would look like this.

MATCH (b:brand)
WHERE  b.deleted IS NULL
MATCH (b)-[:r1]->(p:p)<-[:r2]-(w:w)
WHERE w.deleted IS NULL
WITH count(DISTINCT w) as count, b
WHERE  count >= 0   AND count <= 10
MATCH (c)-[:r3]->(d:d)<-[:r4]-(e:e)
WHERE e.deleted IS NULL
WITH count(DISTINCT e) as count, b
WHERE  count >= 0   AND count <= 10
WITH b ORDER by b.name asc
WITH count(b) as totalCount, collect({id: b.id)[$cursor..($cursor+$limit)] AS brands
RETURN brands, totalCount

How can I optimize this query as it's really slow?


Solution

  • A few thoughts:

    • Property lookups are expensive; is there a way you can get around all the .deleted checks?
    • If possible, can you avoid naming r1, r2, etc.? It's faster when it doesn't have to check the relationship type.
    • You're essentially traversing the entire graph several times. If the paths b-->p<--w and c-->d<--e don't overlap, you can include them both in the match statement, separated by a comma, and aggregate both counts at once
    • I don't know if it'll help much, but you don't need to name p and d since you never refer to them
    • This is a very small improvement, but I don't see a reason to check count >= 0

    Also, I'm sure you have your reasons, but why does the c-->d<--e path matter? This would make more sense to me if it were b-->d<--e to mirror the first portion.

    EDIT/UPDATE: A few things I said need clarification:

    First bullet: The fastest lookup is on a node label; up to 4 labels are essentially O(0). (Well, for anchor nodes, it's slower for downstream nodes.) The second-fastest lookup is on an INDEXED property. My comment above assumed UNINDEXED lookups.

    Second bullet: I think I was just wrong here. Relationships are stored as doubly-linked lists grouped by relationship type. Therefore, always specify relationship type for better performance. Similarly, always specify direction.

    Third bullet: What I said is generally correct, HOWEVER beware of Cartesian joins when you have two MATCH statements separated by a comma. In general, you would only use that structure when you have a common element, like you want directors, actors, and cinematographers all connected to a movie. Still, no overlap between these paths.