Search code examples
machine-learningnlpclassificationmahout

Why vector normalization can improve the accuracy of clustering and classification?


It is described in Mahout in Action that normalization can slightly improve the accuracy. Can anyone explain the reason, thanks!


Solution

  • Normalization is not always required, but it rarely hurts.

    Some examples:

    K-means:

    K-means clustering is "isotropic" in all directions of space and therefore tends to produce more or less round (rather than elongated) clusters. In this situation leaving variances unequal is equivalent to putting more weight on variables with smaller variance.

    Example in Matlab:

    X = [randn(100,2)+ones(100,2);...
         randn(100,2)-ones(100,2)];
    
    % Introduce denormalization
    % X(:, 2) = X(:, 2) * 1000 + 500;
    
    opts = statset('Display','final');
    
    [idx,ctrs] = kmeans(X,2,...
                        'Distance','city',...
                        'Replicates',5,...
                        'Options',opts);
    
    plot(X(idx==1,1),X(idx==1,2),'r.','MarkerSize',12)
    hold on
    plot(X(idx==2,1),X(idx==2,2),'b.','MarkerSize',12)
    plot(ctrs(:,1),ctrs(:,2),'kx',...
         'MarkerSize',12,'LineWidth',2)
    plot(ctrs(:,1),ctrs(:,2),'ko',...
         'MarkerSize',12,'LineWidth',2)
    legend('Cluster 1','Cluster 2','Centroids',...
           'Location','NW')
    title('K-means with normalization')
    

    enter image description here

    enter image description here

    (FYI: How can I detect if my dataset is clustered or unclustered (i.e. forming one single cluster)

    Distributed clustering:

    The comparative analysis shows that the distributed clustering results depend on the type of normalization procedure.

    Artificial neural network (inputs):

    If the input variables are combined linearly, as in an MLP, then it is rarely strictly necessary to standardize the inputs, at least in theory. The reason is that any rescaling of an input vector can be effectively undone by changing the corresponding weights and biases, leaving you with the exact same outputs as you had before. However, there are a variety of practical reasons why standardizing the inputs can make training faster and reduce the chances of getting stuck in local optima. Also, weight decay and Bayesian estimation can be done more conveniently with standardized inputs.

    Artificial neural network (inputs/outputs)

    Should you do any of these things to your data? The answer is, it depends.

    Standardizing either input or target variables tends to make the training process better behaved by improving the numerical condition (see ftp://ftp.sas.com/pub/neural/illcond/illcond.html) of the optimization problem and ensuring that various default values involved in initialization and termination are appropriate. Standardizing targets can also affect the objective function.

    Standardization of cases should be approached with caution because it discards information. If that information is irrelevant, then standardizing cases can be quite helpful. If that information is important, then standardizing cases can be disastrous.


    Interestingly, changing the measurement units may even lead one to see a very different clustering structure: Kaufman, Leonard, and Peter J. Rousseeuw.. "Finding groups in data: An introduction to cluster analysis." (2005).

    In some applications, changing the measurement units may even lead one to see a very different clustering structure. For example, the age (in years) and height (in centimeters) of four imaginary people are given in Table 3 and plotted in Figure 3. It appears that {A, B ) and { C, 0) are two well-separated clusters. On the other hand, when height is expressed in feet one obtains Table 4 and Figure 4, where the obvious clusters are now {A, C} and { B, D}. This partition is completely different from the first because each subject has received another companion. (Figure 4 would have been flattened even more if age had been measured in days.)

    To avoid this dependence on the choice of measurement units, one has the option of standardizing the data. This converts the original measurements to unitless variables.

    enter image description here enter image description here

    Kaufman et al. continues with some interesting considerations (page 11):

    From a philosophical point of view, standardization does not really solve the problem. Indeed, the choice of measurement units gives rise to relative weights of the variables. Expressing a variable in smaller units will lead to a larger range for that variable, which will then have a large effect on the resulting structure. On the other hand, by standardizing one attempts to give all variables an equal weight, in the hope of achieving objectivity. As such, it may be used by a practitioner who possesses no prior knowledge. However, it may well be that some variables are intrinsically more important than others in a particular application, and then the assignment of weights should be based on subject-matter knowledge (see, e.g., Abrahamowicz, 1985). On the other hand, there have been attempts to devise clustering techniques that are independent of the scale of the variables (Friedman and Rubin, 1967). The proposal of Hardy and Rasson (1982) is to search for a partition that minimizes the total volume of the convex hulls of the clusters. In principle such a method is invariant with respect to linear transformations of the data, but unfortunately no algorithm exists for its implementation (except for an approximation that is restricted to two dimensions). Therefore, the dilemma of standardization appears unavoidable at present and the programs described in this book leave the choice up to the user.