Search code examples
rcluster-analysisdistanceeuclidean-distance

dist function in r (stats) for clustering: Should I put my ID variable in row.names?


I have a data frame with some numeric columns and an ID column which is character. When I pass the whole data frame in the dist function it calculates the distance matrix, but when I remove the ID column and pass it to the distance function I do not get the same result.
1) Why this strange behavior?
2) How should one handle the "ID" column in clustering in R? should I drop the ID column or should I put them in row.names.

PS I usually use tibbles and the tools in the tidyverse.


Solution

  • It is not obvious what is going to happen when we pass a data frame containing factor/character variables to dist.

    First, if it's a character of numeric data, such as c("1", "2"), then it will be coerced back to numeric data. In that case, unless differences between ID's have a meaning, you should clearly not include this variable.

    Now let's consider the question what happens if we have a factor of a character not of this type as above. In the C source code we find some important lines:

    static double R_euclidean(double *x, int nr, int nc, int i1, int i2)
    {
        double dev, dist;
        int count, j;
    
        count= 0;
        dist = 0;
        for(j = 0 ; j < nc ; j++) {
        if(both_non_NA(x[i1], x[i2])) {
            dev = (x[i1] - x[i2]);
            if(!ISNAN(dev)) {
            dist += dev * dev;
            count++;
            }
        }
        i1 += nr;
        i2 += nr;
        }
        if(count == 0) return NA_REAL;
        if(count != nc) dist /= ((double)count/nc);
        return sqrt(dist);
    }
    

    First (not in this function), factor/character variables get coerced into NA, when trying to convert them to integers. (The warning message also says that.) As a result, as we see in the code of R_euclidean, we have some rescaling:

    if(count != nc) dist /= ((double)count/nc);
    return sqrt(dist);
    

    where nc is the total number of columns and count is the number of numerical columns. We may verify this:

    k <- 20
    df <- data.frame(a = sample(letters, k, replace = TRUE), 
                     b = sample(letters, k, replace = TRUE), 
                     c = rnorm(k), d = rnorm(k))
    
    max(abs(as.matrix(dist(df)) * sqrt(2 / ncol(df)) - as.matrix(dist(df[, 3:4]))))
    # [1] 7.467696e-09
    

    That is, we compared the distance matrix of df without rescaling (multiplication by sqrt(2 / ncol(df))) and the distance matrix without the two factor variables. There seem to be some numerical errors but the matrices are basically the same.

    Hence, this explains why the results are different. If you are going to use a single matrix for, say, clustering, leaving factors/characters seems to be fine, since scale shouldn't matter. However, in cases where scale matters, you should drop the factor/character columns first. (Whether to use your ID variable as row names or as a separate vector doesn't matter and is up to you.)