I'm doing some work that involves document comparison. To do this, I'm analizing each document, and basically counting the number of times some key words appear on each of these documents. For instance:
Document 1: Document 2:
Book -> 3 Book -> 9
Work -> 0 Work -> 2
Dollar -> 5 Dollar -> 1
City -> 18 City -> 6
So after the counting process, I store all these sequence of numbers in a vector. This sequence of numbers will represent the feature vector for each document.
Document 1: [ 3, 0, 5, 18]
Document 2: [ 9, 2, 1, 6]
The final step would be to normalize the data in a range from [0 1]
. But here is where I realized this could be done following two different approachs:
Following the first approach, the result of the normalization would be:
Document 1: [ 0.11538, 0.00000, 0.19231, 0.69231] (divided by 26)
Document 2: [ 0.50000, 0.11111, 0.05556, 0.33333] (divided by 18)
While following the second approach, the result would be:
Document 1: [ 0.16667, 0.00000, 0.27778, 1.00000] (divided by 18)
Document 2: [ 1.00000, 0.22222, 0.11111, 0.66667] (divided by 9)
For this specific case:
Suppose you have two vectors A
and B
, you use x
as the normalization constant for A
and y
as the normalization constant for B
. Since you are counting word occurrences, we can assume x > 0
and y > 0
.
For cosine distance showing below, normalization constant will be canceled out. It's easy to see, you will finally get a constant 1/(xy)
at the enumerator, and an identical constant 1/(xy)
at the denominator . So you can cancel out 1/(xy)
.
For Euclidean distance, it's not the case above. I list an example below assuming A
and B
are 2-d vectors. n-dimensional vector is just a simple extension on that. A'
and B'
are the normalized vector of A
and B
respectively.
Comparing the unnormalized version of dist(A,B)
with the normalized version of dist(A',B')
, you can see that: the normalization constant you choose (max or sum) determines the weight on x1^2+x2^2
, y1^2+y2^2
and the interacting term. As a result, different normalization constants give you different distances.
If this is for some information retrieval purpose or topic extracting, did you try TF-IDF? That might be a better measure than purely counting the occurrences of terms.