Search code examples
rsimilarity

Minimum dissimilarity between one record and a whole data.frame


I am trying to make feasible the computation of dissimilarity within records of a massive dataset (600,000 records).

The first task is to compute the dissimilarity using Euclidean Distance between one single record and the whole data.frame excluding that record.

Considering the following sample:

mydf <- data.frame(var1 = rnorm(5), var2 = rnorm(5), var3 = rnorm(5))
one_row <- mydf[1,]

The question articulates in two steps:

  1. use a vectorized operation to return a vector of length 4 with the dissimilarity values of one_row compared to each row of mydf[-1,]
  2. from the vector of point 1., extract the index of the row more similar to one_row

Then, I could iterate this process for every row in mydf and, therefore, finding for each row its most similar row. This would allow me to perform agglomerative clustering as well as computing statistics criterion like Silhoutte that are based on distance matrix.

Update

One possible approach is to replicate one_row to the same size of mydf and vectorize the similarity computation by performing it pair-wise.

replicated <- [rep(1, 5), 1:ncol(a)]

Correct Answer

Both the answers of Jesse Tweedle and won782 are correct to my question.

The positive aspect of Jesse Tweedle's is the possibility of customizing the distance function allowing to use mixed data types. The negative side is that it is not a single expression but it is a pipe of functions.

The positive aspect of won782 is that it is performed in a single expression. The negative aspect is that it only works for matrices, therefore, numeric variables.

I choose won782 answer because his solution can be easily extended to be used as fundamental component for computing Silhouette Criterion without storing the dissimilarity matrix.


Solution

  • If I understood your question correctly, you want to perform rowwise operation for a given vector and compute euclidean distance with every rows.

    mydf <- data.frame(var1 = rnorm(5), var2 = rnorm(5), var3 = rnorm(5))
    one_row <- mydf[1,]
    
    result = apply(mydf, 1, function(x) {
      sqrt(sum((x - one_row)^2))
    })
    result
    [1] 0.000000 3.333031 3.737814 1.875482 4.216042
    

    The result is vector of euclidean distances. Then, you can do which.min function to find the index of minimum value.

    Using matrix operation :

    sqrt(rowSums((t(t(as.matrix(mydf)) - as.numeric(one_row)))^2))
    

    Benchmark two methods on larger dataset

    > mydf <- data.frame(var1 = rnorm(10000), var2 = rnorm(10000), var3 = rnorm(10000))
    > one_row <- mydf[1,]
    > # Matrix operation method
    > system.time({ 
    +   sqrt(rowSums((t(t(as.matrix(mydf)) - as.numeric(one_row)))^2))
    +   })
       user  system elapsed 
      0.000   0.000   0.001 
    > # Apply Method
    > system.time({ 
    +   apply(mydf, 1, function(x) {
    +     sqrt(sum((x - one_row)^2))
    +   })
    + })
       user  system elapsed 
      5.186   0.014   5.204 
    

    So clearly, matrix operation is faster method.