Search code examples
neo4jcyphergraph-databasescollaborative-filtering

Collaborative filtering cypher with attributes in neo4j


I am using neo4j to setup a recommender system. I have the following setup:

Nodes:

  • Users
  • Movies
  • Movie attributes (e.g. genre)

Relationships

  • (m:Movie)-[w:WEIGHT {weight: 10}]->(a:Attribute)
  • (u:User)-[r:RATED {rating: 5}]->(m:Movie)

Here is a diagram of how it looks:

enter image description here

I am now trying to figure out how to apply a collaborative filtering scheme that works as follows:

  1. Checks which attributes the user has liked (implicitly by liking the movies)
  2. Find similar other users that have liked these similar attributes
  3. Recommend the top movies to the user, which the user has NOT seen, but similar other users have seen.

The condition is obviously that each attribute has a certain weight for each movie. E.g. the genre adventure can have a weight of 10 for the Lord of Rings but a weight of 5 for the Titanic.

In addition, the system needs to take into account the ratings for each movies. E.g. if other user has rated Lord of the Rings 5, then his/her attributes of the Lord of Ranges are scaled by 5 and not 10. The user that has rated the implicit attributes also close to 5 should then get this movie recommended as opposed to another user that has rated similar attributes higher.

I made a start by simply recommending only other movies that other users have rated, but I am not sure how to take into account the relationships RATING and WEIGHT. It also did not work:

MATCH (user:User)-[:RATED]->(movie1)<-[:RATED]-(ouser:User),
         (ouser)-[:RATED]->(movie2)<-[:RATED]-(oouser:User)
WHERE user.uid = "user4"
AND   NOT    (user)-[:RATED]->(movie2)
RETURN oouser

Solution

  • What you are looking for, mathematically speaking, is a simplified Jaccard index between two users. That is, how similar are they based on how many things they have in common. I say simplified because we are not taking into account the movies they disagree about. Essentially, and following your order, it would be:

    1) Get the total weight of every Attribute for every user. For instance:

    MATCH (user:User{name:'user1'})
    OPTIONAL MATCH (user)-[r:RATED]->(m:Movie)->[w:WEIGHT]->(a:Attribute)
    WITH user, r.rating * w.weight AS totalWeight, a
    WITH user, a, sum(totalWeight) AS totalWeight
    

    We need the last line because we had a row for each Movie-Attribute combination

    2) Then, we get users with similar tastes. This is a performance danger zone, some filtering might be neccesary. But brute forcing it, we get users that like each attribute within an 10% error (for instance)

    WITH user, a, totalWeight*0.9 AS minimum, totalWeight*1.10 AS maximum
    MATCH (a)<-[w:WEIGHT]-(m:Movie)<-[r:RATES]-(otherUser:User)
    WITH user, a, otherUser
    WHERE w.weight * r.rating > minimum AND w.weight * r.rating < maximum
    WITH user, otherUser
    

    So now we have a row (unique because of last line) with any otherUser that is a match. Here, to be honest, I would need to try to be sure if otherUsers with only 1 genre match would be included.. if they are, an additional filter would be needed. But I think that should go after we get this going.

    3) Now it´s easy:

    MATCH (otherUser)-[r:RATES]->(m:Movie)
    WHERE NOT (user)-[:RATES]->(m)
    RETURN m, sum(r.rating) AS totalRating ORDER BY totalRating DESC
    

    As mentioned before, the tricky part is 2), but after we know how to get the math going, it should be easier. Oh, and about math, for it to work properly, total weights for a movie should sum 1 (normalizing). In any other case, the difference between total weights for movies would cause an unfair comparison.

    I wrote this without proper studying (paper, pencil, equations, statistics) and trying the code in a sample dataset. I hope it can help you anyway!

    In case you want this recommendation without taking into account user ratings or attribute weights, it should be enough to substitute the math in lines in 1) and 2) with just r.rating or w.weight, respectively. RATES and WEIGHTS relationships would still be used, so for instance an avid consumer of Adventure movies would be recommended Movies by consumers of Adventure movies, but not modified by ratings or by attribute weight, as we chose.

    EDIT: Code edited to fix syntax errors discussed in comments.