I'd like to create a distance-matrix with weighted euclidean distances from a data frame. The weights will be defined in a vector. Here's an example:
library("cluster")
a <- c(1,2,3,4,5)
b <- c(5,4,3,2,1)
c <- c(5,4,1,2,3)
df <- data.frame(a,b,c)
weighting <- c(1, 2, 3)
dm <- as.matrix(daisy(df, metric = "euclidean", weights = weighting))
I've searched everywhere and can't find a package or solution to this in R. The 'daisy' function within the 'cluster' package claims to support weighting, but the weights don't seem to be applied and it just spits out regular euclid. distances.
Any ideas Stack Overflow?
We can use @WalterTross' technique of scaling by multiplying each column by the square root of its respective weight first:
newdf <- sweep(df, 2, weighting, function(x,y) x * sqrt(y))
as.matrix(daisy(newdf, metric="euclidean"))
But just in case you would like to have more control and understanding of what euclidean distance is, we can write a custom function. As a note, I have chosen a different weighting method. :
xpand <- function(d) do.call("expand.grid", rep(list(1:nrow(d)), 2))
euc_norm <- function(x) sqrt(sum(x^2))
euc_dist <- function(mat, weights=1) {
iter <- xpand(mat)
vec <- mapply(function(i,j) euc_norm(weights*(mat[i,] - mat[j,])),
iter[,1], iter[,2])
matrix(vec,nrow(mat), nrow(mat))
}
We can test the result by checking against the daisy
function:
#test1
as.matrix(daisy(df, metric="euclidean"))
# 1 2 3 4 5
# 1 0.000000 1.732051 4.898979 5.196152 6.000000
# 2 1.732051 0.000000 3.316625 3.464102 4.358899
# 3 4.898979 3.316625 0.000000 1.732051 3.464102
# 4 5.196152 3.464102 1.732051 0.000000 1.732051
# 5 6.000000 4.358899 3.464102 1.732051 0.000000
euc_dist(df)
# [,1] [,2] [,3] [,4] [,5]
# [1,] 0.000000 1.732051 4.898979 5.196152 6.000000
# [2,] 1.732051 0.000000 3.316625 3.464102 4.358899
# [3,] 4.898979 3.316625 0.000000 1.732051 3.464102
# [4,] 5.196152 3.464102 1.732051 0.000000 1.732051
# [5,] 6.000000 4.358899 3.464102 1.732051 0.000000
The reason I doubt Walter's method is because firstly, I've never seen weights applied by their square root, it's usually 1/w
. Secondly, when I apply your weights to my function, I get a different result.
euc_dist(df, weights=weighting)