Search code examples
performanceneo4jcypherquery-optimization

Efficient query for intersections of shared relationships in Neo4J


We have a Neo4J (5.2.0) database containing two types of nodes Topic and Document. Topics and documents are linked by OCCURS_IN relationships (:Topic)-[:OCCURS_IN]->(:Document) and OCCURS_IN_RANDOM_SAMPLE relationships (:Topic)-[:OCCURS_IN_RANDOM_SAMPLE]->(:Document) where OCCURS_IN_RANDOM_SAMPLE is a precomputed subset of OCCURS_IN limited to 10000 per Topic/Document pair. Topics are indexed by keyphrases.

The goal of our queries is to compute similarities between Topics based on their relationships with the documents. Our initial query was of the form:

UNWIND $topics as topic
MATCH (left:Topic {keyphrase: $left})-[:OCCURS_IN_RANDOM_SAMPLE]            
    ->(document:Document)<-[:OCCURS_IN]
    -(right:Topic {keyphrase: topic})
WITH right.keyphrase as right, 
    left.support as lsupport, 
    right.support as rsupport, 
    toFloat(count(document)) as intersection
...

with the constraint that the support of left is lower than the support of every Topic in topics.

Our fastest iteration of this query is:

UNWIND $topics as topic
MATCH (right:Topic {keyphrase: topic})
WITH right
MATCH (left:Topic {keyphrase: $left})-[:OCCURS_IN_RANDOM_SAMPLE]            
    ->(document:Document)
WHERE EXISTS((document)<-[:OCCURS_IN]-(right))
WITH right.keyphrase as right, 
    left.support as lsupport, 
    right.support as rsupport, 
    toFloat(count(document)) as intersection
...

But even with it and the precomputed samples, it is still slow when comparing topics with supports of thousands of documents. The bottleneck is oviously to compute this intersection count. Is there a more efficient way to improve it? Subsidiary question: Does it normally make sense to use EXISTS instead of matching the whole pattern or do we face a weird edge case where it works?

Thanks in advance :)


Solution

  • Your query is causing an unnecessary cartesian product.

    Here is a simpler query that should be faster (you can use PROFILE to compare queries):

    MATCH (left:Topic)-[:OCCURS_IN_RANDOM_SAMPLE]->(doc)<-[:OCCURS_IN]-(right:Topic)
    WHERE left.keyphrase = $left AND right.keyphrase IN $topics
    WITH right.keyphrase AS right, 
        left.support AS lsupport, 
        right.support AS rsupport, 
        TOFLOAT(COUNT(doc)) AS intersection
    ...
    

    Also, while it makes sense to use EXISTS to avoid unnecessarily repeating subpath traversals, it seems unlikely that a given Topic node would have more than one OCCURS_IN relationship to a given Document node. Therefore, EXISTS would not be buying you anything.