Search code examples
pythonnlpsimilaritynltktf-idf

Cosine Similarity of Vectors of different lengths?


I'm trying to use TF-IDF to sort documents into categories. I've calculated the tf_idf for some documents, but now when I try to calculate the Cosine Similarity between two of these documents I get a traceback saying:

#len(u)==201, len(v)==246

cosine_distance(u, v)
ValueError: objects are not aligned

#this works though:
cosine_distance(u[:200], v[:200])
>> 0.52230249969265641

Is slicing the vector so that len(u)==len(v) the right approach? I would think that cosine similarity would work with vectors of different lengths.

I'm using this function:

def cosine_distance(u, v):
    """
    Returns the cosine of the angle between vectors v and u. This is equal to
    u.v / |u||v|.
    """
    return numpy.dot(u, v) / (math.sqrt(numpy.dot(u, u)) * math.sqrt(numpy.dot(v, v))) 

Also -- is the order of the tf_idf values in the vectors important? Should they be sorted -- or is it of no importance for this calculation?


Solution

  • Are you computing the cosine similarity of term vectors? Term vectors should be the same length. If words aren't present in a document then it should have a value of 0 for that term.

    I'm not exactly sure what vectors you're applying cosine similarity for but when doing cosine similarity then your vectors should always be the same length and order very much does matter.

    Example:

    Term | Doc1 | Doc2
    Foo     .3     .7
    Bar  |  0   |  8
    Baz  |  1   |  1
    

    Here you have two vectors (.3,0,1) and (.7,8,1) and can compute the cosine similarity between them. If you compared (.3,1) and (.7,8) you'd be comparing the Doc1 score of Baz against the Doc2 score of Bar which wouldn't make sense.