Search code examples
pythonperformancesetsimilarity

Efficient way to to find Objects with common attributes in 2 python-lists


The heading might be confusing - i will try to explain what I want to do: I'm studying computer-science and trying to implement a small movie-recommender as a project for my lecture "Data-Warehousing & Data Mining". Right now I'm trying to compute the similarity of 2 users according to their movie-ratings.

class Rating(Model): 
    def __init__(self, userID, movieID, rating):
          ...

I overrode __eq__, __ne__ and __hash__ of Rating but only considering the movieID to make it possible to create a set of Ratings of 2 Users to find movies they both have rated.

def similarity(userA, userB):
    ratingsA = userA.ratings
    ratingsB = userB.ratings
    common_ratings = set((ratingsA, ratingsB))

What i want now is something like the following: 2 Lists sorted in the same order to make it possible to calculate the cosin-distance of the users/their ratings.

[Rating(userID=1, movieID=4, rating=4.7), Rating(user=1, movie=7, rating=9.8)]
[Rating(userID=2, movieID=4, rating=2.0), Rating(user=2, movie=7, rating=6.6)]    

I really don't like my approach, but I couldn't find a better one the last couple hours.

Another, less efficient way (I think?) would be like this:

lA = []
lB = []
for rA in ratingsA:
    for rB in ratingsB: 
        if rA.movieID == rB.movieID:
            lA.append(rA)
            lB.append(rB)
sim = cos_dist(lA, lB)

This approach would probably work but i guess the runtime would be horrible, since there are around 40000 movies and the probability that 2 users have rated same movies is pretty low...

Maybe someone has an efficient approach? Thank you in advance!


Solution

  • Your approach is O(N^2) worst case. You can reduce the complexity to O(N log N) sorting the ratings lists:

    sorted_ratingsA = sorted(ratingsA, lambda x: x.movieID)
    sorted_ratingsB = sorted(ratingsB, lambda x: x.movieID)
    

    And now we can pop the items from these lists from the last one (for efficiency reasons) and use the order over the movieID to check whether a certain id was rated by the user. Something along the lines of:

    lA = []
    lB = []
    maxA = sorted_ratingsA.pop()
    maxB = sorted_ratingsB.pop()
    while sorted_ratingsA and sorted_ratingsB:
        if maxA.movieID == maxB.movieID:
            lA.append(maxA)
            lb.append(maxB)
            # instead of the following two pop calls you could simply
            # change the elif into a new if statement.
            maxA = sorted_ratingsA.pop()
            maxB = sorted_ratingsB.pop()
        elif maxA < maxB:
            maxB = sorted_ratingsB.pop()
        else:
            maxA = sorted_ratingsA.pop()
    

    As you can see the list that contain the maximum value is popped until either an equal id is found or until the id goes below, in that case you start popping from the other list. The fact the the lists are in ascending order means you find all matches in O(N log N).

    It is essential to use pop() because popping the end of a list takes amortized O(1) time, while using something like pop(0) would cost O(N) for each pop on average and would reintroduce an O(N^2) factor.


    An alternative approach is to simply use hashing, and this should get you on average time O(N). You first create two maps from movieID to ratings, and then intersect the maps:

    mapA = {x.movieId: x for x in ratingsA}
    mapB = {x.movieId: x for x in ratingsB}
    common_keys = mapA.keys() & mapB.keys()
    
    lA = [mapA[k] for k in common_keys]
    lB = [mapB[k] for k in common_keys]
    

    If you are using python<3.x replace keys() with viewkeys().

    Note: even if this solution uses hashing, the order of lA and lB matches because iteration order over a set changes only if the set is modified, so the two iterations above retrieve the corresponding ratings. However the order of the ratings itself is not defined (so you don't know the order in which the movieID will appear, but you know that it will match between lA and lB).


    You didn't mention SQL in your question, in any case if these objects are in an SQL database it's better to just let the database do the search for you. You probably have a rankings table with various fields and you want to do:

    SELECT * FROM rankings
    JOIN rankings AS rankings2
    ON rankings.movieID = rankings2.movieID