I am trying to make feasible the computation of dissimilarity within records of a massive dataset (600,000 records).
The first task is to compute the dissimilarity using Euclidean Distance between one single record and the whole data.frame excluding that record.
Considering the following sample:
mydf <- data.frame(var1 = rnorm(5), var2 = rnorm(5), var3 = rnorm(5))
one_row <- mydf[1,]
one_row
compared to each row of mydf[-1,]
one_row
Then, I could iterate this process for every row in mydf and, therefore, finding for each row its most similar row. This would allow me to perform agglomerative clustering as well as computing statistics criterion like Silhoutte that are based on distance matrix.
One possible approach is to replicate one_row to the same size of mydf and vectorize the similarity computation by performing it pair-wise.
replicated <- [rep(1, 5), 1:ncol(a)]
Both the answers of Jesse Tweedle and won782 are correct to my question.
The positive aspect of Jesse Tweedle's is the possibility of customizing the distance function allowing to use mixed data types. The negative side is that it is not a single expression but it is a pipe of functions.
The positive aspect of won782 is that it is performed in a single expression. The negative aspect is that it only works for matrices, therefore, numeric variables.
I choose won782 answer because his solution can be easily extended to be used as fundamental component for computing Silhouette Criterion without storing the dissimilarity matrix.
If I understood your question correctly, you want to perform rowwise operation for a given vector and compute euclidean distance with every rows.
mydf <- data.frame(var1 = rnorm(5), var2 = rnorm(5), var3 = rnorm(5))
one_row <- mydf[1,]
result = apply(mydf, 1, function(x) {
sqrt(sum((x - one_row)^2))
})
result
[1] 0.000000 3.333031 3.737814 1.875482 4.216042
The result is vector of euclidean distances. Then, you can do which.min
function to find the index of minimum value.
Using matrix operation :
sqrt(rowSums((t(t(as.matrix(mydf)) - as.numeric(one_row)))^2))
Benchmark two methods on larger dataset
> mydf <- data.frame(var1 = rnorm(10000), var2 = rnorm(10000), var3 = rnorm(10000))
> one_row <- mydf[1,]
> # Matrix operation method
> system.time({
+ sqrt(rowSums((t(t(as.matrix(mydf)) - as.numeric(one_row)))^2))
+ })
user system elapsed
0.000 0.000 0.001
> # Apply Method
> system.time({
+ apply(mydf, 1, function(x) {
+ sqrt(sum((x - one_row)^2))
+ })
+ })
user system elapsed
5.186 0.014 5.204
So clearly, matrix operation is faster method.