Search code examples
pythonscikit-learnnlpdata-science

Does Euclidean Distance change when strings "double"?


Problem I'm trying to answer this question: Consider two documents A and B whose Euclidean distance is d and cosine similarity is c (using no normalization other than raw term frequencies). If we create a new document A' by appending A to itself and anotherdocument B' by appending B to itself, then:

a.What is the Euclidean distance between A' and B' (using raw term frequency)?

My solution

doc1 = "the quicker brown dogs easily jumps over the lazy dogs" 
doc2 = "the quicker dogs pose a serious problem for lazy dogs"

def calc_term_frequency(doc : list):

    dic = {}
    for word in doc.split():
        if word in dic:
            dic[word] = dic[word] + 1
        else:
            dic[word]= 1

    for word, frequency in dic.items():
       dic[word]= frequency / len(doc.split())

    return dic

tfs_doc1 = calc_term_frequency(doc1)
tfs_doc2 = calc_term_frequency(doc2)
print(tfs_doc1)

Outputs tfs_doc1 as: {'the': 0.2, 'quicker': 0.1, 'brown': 0.1, 'dogs': 0.2, 'easily': 0.1, 'jumps': 0.1, 'over': 0.1, 'lazy': 0.1} This seems like it works at it should. I then proceed to calculate the Euclidean Distance, first between doc1 and doc1 and then doc1 and doc2, shown below.

import math
math.sqrt(sum((tfs_doc1.get(k, 0) - tfs_doc1.get(k, 0))**2 for k in set(tfs_doc1.keys()).union(set(tfs_doc1.keys())))) # output: 0
math.sqrt(sum((tfs_doc1.get(k, 0) - tfs_doc2.get(k, 0))**2 for k in set(tfs_doc1.keys()).union(set(tfs_doc2.keys())))) # output: 0.316227766016838

This gives me a score of 0.316227766016838. When I try to verify that this is correct using sklearn, like below:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import euclidean_distances

corpus_vect = CountVectorizer().fit_transform(corpus).todense() 

print(euclidean_distances(corpus_vect[0], corpus_vect[0])) # output: 0
print(euclidean_distances(corpus_vect[0], corpus_vect[1] )) # output: 3.

I get an output of [[0.]] [[3.]], which translates to round(, 1) of my "manual" result.

The problem: when I try to answer the initial questions and "double" the strings, e.g.

doc1 = "the quicker brown dogs easily jumps over the lazy dogs the quicker brown dogs easily jumps over the lazy dogs" 
doc2 = "the quicker dogs pose a serious problem for lazy dogs the quicker dogs pose a serious problem for lazy dogs"

I get the same output for the manual technique (0.316227766016838) but [[0.]] [[6.]] when using the "sklearn method" / Vectorizer. So, using one method the ED stays the same and using the other it doubles!

What is the correct solution and what causes the difference? Really stuck here. Thanks in advance.


Solution

  • As you doubled the string, frequency of all terms (including raw terms) will be duplicated. Hence, if before duplication you have a (a1, a2, ..., ad) and (b1, b2, ..., bd) frequency vector for document A and B, respectively, the Euclidian diastance will be sqrt((a1-b1)^2 + (a2-b2)^2 + ... + (ad - bd)^2). Now, after duplication we have (2 * a1, 2 * a2, ..., 2 * ad) and (2 * b1,2 * b2, ...,2 * bd) and the distance is:

    dist(A', B') = sqrt((2 * a1- 2 * b1)^2 + (2 * a2 - 2 *b2)^2 + ... + (2 * ad - 2 * bd)^2) = 
    2 * sqrt((a1-b1)^2 + (a2-b2)^2 + ... + (ad - bd)^2) = 2 * dist(A,B)
    

    Beware, in the manual solution you are dividing the frequency by the length of the document and it prevents duplicating the term frequency.