Search code examples
pythonsimilarityeuclidean-distance

Euclidian distance between posts based on tags


I am playing with the euclidian distance example from programming collective intelligence book,


# Returns a distance-based similarity score for person1 and person2 
def sim_distance(prefs,person1,person2): 
  # Get the list of shared_items 
  si={} 
  for item in prefs[person1]: 
    if item in prefs[person2]: 
       si[item]=1 
  # if they have no ratings in common, return 0 
  if len(si)==0: return 0 
  # Add up the squares of all the differences 
  sum_of_squares=sum([pow(prefs[person1][item]-prefs[person2][item],2) 
                      for item in prefs[person1] if item in prefs[person2]]) 

this is the original code for ranking movie critics, i am trying to modify this to find similar posts, based on tags i build a map such as,

url1 - > tag1 tag2
url2 - > tag1 tag3

but if apply this to the function,

pow(prefs[person1][item]-prefs[person2][item],2) 

this becomes 0 cause tags don't have weight same tags has ranking 1. I modified the code to manually create a difference to test,

pow(prefs[1,2) 

then i got a lot of 0.5 similarity, but similarity of the same post to it self is dropped down to 0.3. I can't think of a way to apply euclidian distance to my situation?


Solution

  • Okay, first off, your code looks incomplete: I see only one return from your function. I think you mean something like this:

    def sim_distance(prefs, person1, person2): 
      # Get the list of shared_items
      p1, p2 = prefs[person1], prefs[person2]
      si = set(p1).intersection(set(p2))
    
      # Add up the squares of all the differences 
      matches = (p1[item] - p2[item] for item in si)
      return sum(a * a for a in matches) 
    

    Next, your post needs a bit of editing for clarity. I don't know what this means: "this becomes 0 cause tags don't have weight same tags has ranking 1."

    Lastly, it would help if you provided sample data for prefs[person1] and prefs[person2]. Then you could tell what you are getting and what you expect to get.

    Edit: based on my comment below, I would use code like this:

    def sim_distance(prefs, person1, person2):
        p1, p2 = prefs[person1], prefs[person2]
        s, t = set(p1), set(p2)
        return len(s.intersection(t)) / len(s.union(t))