Search code examples

Preventing duplicate SIMILAR relationships when using algo.similarity.jaccard on continuously updated data

I am computing the Jaccard similarity index for a category of nodes in a graph using the algo.similarity.jaccard algorithm from the Neo4j graph algorithm's library. Once calculating the Jaccard similarity and indicating a cutoff, I am storing the metric in a relationship between the nodes (this is a feature of the algorithm). I am trying to see the change of the graph over time as I get new data to add into the graph (I will be reloading my CSV file with new data and merging in new nodes/relationships).

A problem I foresee is that once I run the Jaccard algorithm again with the updated graph, it will create duplicate relationships. This is the Neo4j documentation example of the code that I am using:

MATCH (p:Person)-[:LIKES]->(cuisine)
WITH {item:id(p), categories: collect(id(cuisine))} as userData
WITH collect(userData) as data
CALL algo.similarity.jaccard(data, {topK: 1, similarityCutoff: 0.1, write:true})
YIELD nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, stdDev, p25, p50, p75, p90, p95, p99, p999, p100
RETURN nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, p95

Is there a way to specify I do not want to have duplicate relationships each time I run this code with an updated graph? Manually, I'd use MERGE instead of CREATE but seeing as though this an algorithm from a library, I'm not sure how to go about that. FYI I will not have the ability to add changes to a library plug in and it seems like there is no way to store the relationship under a different label such as SIMILARITY2.


  • There are at least 2 ways to avoid duplicate relationships from multiple calls to algo.similarity.jaccard:

    1. Delete the existing relationships (by default, they have the SIMILAR type) before each call. This is probably the easiest approach.

    2. Omit the write:true option when making the calls (so that the procedure won't create relationships at all), and write your own Cypher code to optionally create relationships that do not already exist (using MERGE).


    Here is an example of the second approach (using the variant of the procedure, which yields more useful values for our purposes):

    MATCH (p:Person)-[:LIKES]->(cuisine)
    WITH {item:id(p), categories: collect(id(cuisine))} as userData
    WITH collect(userData) as data
    CALL, {topK: 1, similarityCutoff: 0.1})
    YIELD item1, item2, similarity
    WHERE item1 < item2
    WITH algo.getNodeById(item1) AS n1, algo.getNodeById(item2) AS n2, similarity
    MERGE (n1)-[s:SIMILAR]-(n2)
    SET s.score = similarity
    RETURN *

    Since the procedure will return the same node pair twice (with the same similarity score), the WHERE clause is used to filter out one of the pairs, to speed up processing. The algo.getNodeById() utility function is used to get a node by its native ID. And the MERGE clause's relationship pattern does not specify a value for score, so that it will match an existing relationship even if it has a different value. The SET clause for setting the score is placed after the MERGE, which also helps to ensure the value is up to date.