Search code examples
neo4jcypherpy2neo

Jaccard Similarity how to create "SIMILAR" relationship using writeRelationshipType


I am trying to suggest keywords based on a Jaccard Similarity cutoff point. The end goal would be to use py2neo and call this query whenever the user wants recommended keywords. My rationale is:

(Title1)-[:HAS_KEYWORDS]->(Keyword1)<-[:HAS_KEYWORDS]-(Title2)-[:HAS_KEYWORDS]->(Keyword2)

I was following the example found in the handbook:
https://neo4j.com/docs/graph-algorithms/current/algorithms/similarity-jaccard/
An representation of my data is as follows: A representation of my test data csv files is as follows: CSV used to create all title nodes:

title_id,title  
T1,Article Title 1  
T2,Article Title 2 

CSV that I want to use to create the relationships:

title_id,keyword_id,keyword  
T1,K1,aaa  
T1,K2,bbb  
T1,K3,ccc  
T1,K4,ddd  
T2,K1,aaa  
T2,K5,eee  
T2,K6,fff  
T2,K4,ddd  

I am currently at the point where I calculate the similarity:

I have tried the following:

MATCH (search_query:Title)-[:HAS_KEYWORDS]->(k_id:Keyword)
<-[:HAS_KEYWORDS]-(return_query:Title)-[r2:HAS_KEYWORDS]->(rec_k:Keyword)  
WITH {item:id(return_query), categories: collect(id(rec_k))} as userData  
WITH collect(userData) as data  
CALL algo.similarity.jaccard.stream(data, {similarityCutoff: 0.0})  
YIELD item1, item2, count1, count2, intersection, similarity  
RETURN algo.getNodeById(item1) AS from, algo.getNodeById(item2) AS to,  intersection, similarity ORDER BY similarity DESC  

However, as I continue through the example, the example uses another query, which I have also tried to replicate:

MATCH (search_query:Title)
  -[:HAS_KEYWORDS]->(k_id:Keyword)
 <-[:HAS_KEYWORDS]-(return_query:Title)
  -[r2:HAS_KEYWORDS]->(rec_k:Keyword)     
WITH {item:id(return_query), categories: collect(id(rec_k))} as userData 
WITH collect(userData) as data  
CALL algo.similarity.jaccard(data, {topK: 1, similarityCutoff: 0.0, write:true})  
YIELD nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, stdDev, p25, p50, p75, p90, p95, p99, p999, p100  
RETURN nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, p95  

I am attempting to go onto the next step and query for the SIMILAR relationship,
but as I check my results, I find that the Similar relationship has not been created in my test graph. Thus, my first question would be: Q: Why doesn't the SIMILAR relationship appear in my graph?
(A related sub question: I believe my MATCH logic would search for all titles that have at least one common keyword with another title, the other title must also have at least one other non-related keyword. If I use the second example, would I only be able to create one SIMILAR relationship?)

My second question would be related to my end goal. Q: If I understand the query correctly, only the most similar result will have the SIMILAR relationship stored in the database; would I be able to use the same query inside the function? Currently my function looks something like this:

def get_similar_keywords(self):
    '''
    MATCH (search_query:Title)
          -[:HAS_KEYWORDS]->(k_id:Keyword)
         <-[:HAS_KEYWORDS]-(return_query:Title)
          -[r2:HAS_KEYWORDS]->(rec_k:Keyword)
    WITH {item:id(return_query), categories: collect(id(rec_k))} as userData
    WITH collect(userData) as data
    CALL algo.similarity.jaccard(data, {topK: 1, similarityCutoff: 0.0, write:true})
    YIELD nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, stdDev, p25, p50, p75, p90, p95, p99, p999, p100
    RETURN nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, p95
    '''
    return graph.run(query, username=self.username)

Right now, my goals would be to find out: 1. If my idea behind the MATCH conditions are wrong; 2. How to create the SIMILAR relationship using write relationship type and 3. Find out whether these queries can be reused.

Currently, after playing with the variables. I think I have Jaccard similarity values that look correct:

╒═══════╤═════════════════╤═══════╤═══════════════════════╤═══════════════╤═>══════════════════╤══════════════════╤══════════════════╤══════════════════╕ │"nodes"│"similarityPairs"│"write"│"writeRelationshipType"│"writeProperty"│">min" │"max" │"mean" │"p95" │ ╞═══════╪═════════════════╪═══════╪═══════════════════════╪═══════════════╪═>══════════════════╪══════════════════╪══════════════════╪══════════════════╡ │7 │5 │false │"SIMILAR" │"score" >│0.01162785291671753│0.5844191908836365│0.2831512808799744│0.584419190883636>5│ └───────┴─────────────────┴───────┴───────────────────────┴───────────────┴─>──────────────────┴──────────────────┴──────────────────┴──────────────────┘
I just don't quite get why it shows it has "SIMILAR" but nothing shows up on the graph...

If I am on the right track, I would like to replicate this code:

MATCH (p:Person {name: "Praveena"})-[:SIMILAR]->(other),
      (other)-[:LIKES]->(cuisine)  
WHERE not((p)-[:LIKES]->(cuisine))  
RETURN cuisine.name AS cuisine  

... and return recommended keywords through py2neo.

Thank you very much,

Eric


Solution

  • If you want to write back the SIMILAR relationship you have to use similarityCutoff: 0.1 or higher. Check the source code for more information why.

    Also, your MATCH query is a little off, so the write back query should look like:

    MATCH (search_query:Title)-[:HAS_KEYWORDS]->(k_id:Keyword)
    
    WITH {item:id(search_query), categories: collect(id(k_id))} as userData
    WITH collect(userData) as data
    CALL algo.similarity.jaccard(data, {topK: 1, similarityCutoff: 0.1, write:true})
    YIELD nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, stdDev, p25, p50, p75, p90, p95, p99, p999, p100
    RETURN nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, p95
    

    You input id of the Title as item and the ids of all keywords describing the title as categories and let algorithm handle the rest.

    Now that you have stored relationships you can perform the recommendations query.

    MATCH (p:Title {name: "T1"})-[:SIMILAR]->(other),
          (other)-[:HAS_KEYWORDS]->(keyword)  
    WHERE not((p)-[:HAS_KEYWORDS]->(keyword))  
    RETURN keyword.name AS keywords