Search code examples
pattern-matchingnormalizationsimilarity

Normalizing by max value or by total value?


I'm doing some work that involves document comparison. To do this, I'm analizing each document, and basically counting the number of times some key words appear on each of these documents. For instance:

Document 1:                          Document 2:
    Book   -> 3                          Book   -> 9
    Work   -> 0                          Work   -> 2
    Dollar -> 5                          Dollar -> 1
    City   -> 18                         City   -> 6

So after the counting process, I store all these sequence of numbers in a vector. This sequence of numbers will represent the feature vector for each document.

Document 1: [ 3,  0,  5, 18]
Document 2: [ 9,  2,  1,  6]

The final step would be to normalize the data in a range from [0 1]. But here is where I realized this could be done following two different approachs:

  1. Dividing each sequence of numbers by the total number of repetitions
  2. Dividing each sequence of numbers by the maximum number of repetitions

Following the first approach, the result of the normalization would be:

Document 1: [ 0.11538,  0.00000,  0.19231, 0.69231]   (divided by 26)
Document 2: [ 0.50000,  0.11111,  0.05556, 0.33333]   (divided by 18)

While following the second approach, the result would be:

Document 1: [ 0.16667,  0.00000,  0.27778, 1.00000]   (divided by 18)
Document 2: [ 1.00000,  0.22222,  0.11111, 0.66667]   (divided by  9)

For this specific case:

  • Which of these two approaches will enhance the representation and comparisson of feature vector?
  • Are the results going to be the same?
  • Will any of these approaches will work better with a specific measure of similarity (euclidian, cosine)?

Solution

  • Notation

    Suppose you have two vectors A and B, you use x as the normalization constant for A and y as the normalization constant for B. Since you are counting word occurrences, we can assume x > 0 and y > 0.

    Cosine distance

    For cosine distance showing below, normalization constant will be canceled out. It's easy to see, you will finally get a constant 1/(xy) at the enumerator, and an identical constant 1/(xy) at the denominator . So you can cancel out 1/(xy).

    enter image description here

    Euclidean distance

    For Euclidean distance, it's not the case above. I list an example below assuming A and B are 2-d vectors. n-dimensional vector is just a simple extension on that. A' and B' are the normalized vector of A and B respectively.

    enter image description here

    Comparing the unnormalized version of dist(A,B) with the normalized version of dist(A',B'), you can see that: the normalization constant you choose (max or sum) determines the weight on x1^2+x2^2, y1^2+y2^2 and the interacting term. As a result, different normalization constants give you different distances.

    Feature Vector

    If this is for some information retrieval purpose or topic extracting, did you try TF-IDF? That might be a better measure than purely counting the occurrences of terms.