Search code examples
rcluster-analysiseuclidean-distancer-daisy

Weighted Euclidean Distance in R


I'd like to create a distance-matrix with weighted euclidean distances from a data frame. The weights will be defined in a vector. Here's an example:

library("cluster")

a <- c(1,2,3,4,5)
b <- c(5,4,3,2,1)
c <- c(5,4,1,2,3)
df <- data.frame(a,b,c)

weighting <- c(1, 2, 3)

dm <- as.matrix(daisy(df, metric = "euclidean", weights = weighting))

I've searched everywhere and can't find a package or solution to this in R. The 'daisy' function within the 'cluster' package claims to support weighting, but the weights don't seem to be applied and it just spits out regular euclid. distances.

Any ideas Stack Overflow?


Solution

  • We can use @WalterTross' technique of scaling by multiplying each column by the square root of its respective weight first:

    newdf <- sweep(df, 2, weighting, function(x,y) x * sqrt(y))
    as.matrix(daisy(newdf, metric="euclidean"))
    

    But just in case you would like to have more control and understanding of what euclidean distance is, we can write a custom function. As a note, I have chosen a different weighting method. :

    xpand <- function(d) do.call("expand.grid", rep(list(1:nrow(d)), 2))
    euc_norm <- function(x) sqrt(sum(x^2))
    euc_dist <- function(mat, weights=1) {
      iter <- xpand(mat)
      vec <- mapply(function(i,j) euc_norm(weights*(mat[i,] - mat[j,])), 
                    iter[,1], iter[,2])
      matrix(vec,nrow(mat), nrow(mat))
    }
    

    We can test the result by checking against the daisy function:

    #test1
    as.matrix(daisy(df, metric="euclidean"))
    #          1        2        3        4        5
    # 1 0.000000 1.732051 4.898979 5.196152 6.000000
    # 2 1.732051 0.000000 3.316625 3.464102 4.358899
    # 3 4.898979 3.316625 0.000000 1.732051 3.464102
    # 4 5.196152 3.464102 1.732051 0.000000 1.732051
    # 5 6.000000 4.358899 3.464102 1.732051 0.000000
    
    euc_dist(df)
    #          [,1]     [,2]     [,3]     [,4]     [,5]
    # [1,] 0.000000 1.732051 4.898979 5.196152 6.000000
    # [2,] 1.732051 0.000000 3.316625 3.464102 4.358899
    # [3,] 4.898979 3.316625 0.000000 1.732051 3.464102
    # [4,] 5.196152 3.464102 1.732051 0.000000 1.732051
    # [5,] 6.000000 4.358899 3.464102 1.732051 0.000000
    

    The reason I doubt Walter's method is because firstly, I've never seen weights applied by their square root, it's usually 1/w. Secondly, when I apply your weights to my function, I get a different result.

    euc_dist(df, weights=weighting)