Search code examples
rsimilaritycosine-similarity

How to find best resemblance between 1 row and the rest of dataframe in R?


How can I find the best resemblance between one particular row and the rest of the rows in a dataframe?

I try to explain what I mean. Take a look at this dataframe:

df <- structure(list(person = 1:5, var1 = c(1L, 5L, 2L, 2L, 5L), var2 = c(4L, 
4L, 3L, 2L, 2L), var3 = c(5L, 4L, 4L, 3L, 1L)), .Names = c("person", 
"var1", "var2", "var3"), class = "data.frame", row.names = c(NA, 
-5L))

How can I find the best resemblance between person 1 (row 1) and the rest of the rows (persons) in the data frame. The output should be something like: person 1 still in row 1 and the rest of the rows in order of best resemblance. The simmilarity algorithm I want to use is cosine or pearson. I tried to solve my problem with functions from the arules package, but it didn't match well with my needs.

Any ideas someone?


Solution

  • Another idea is to define the cosine function manually, and apply it on your data frame, i.e.

    f1 <- function(x, y){
      crossprod(x, y)/sqrt(crossprod(x) * crossprod(y))
    }
    
    df[c(1, order(sapply(2:nrow(df), function(i) 
                                    f1(unlist(df[1,-1]), unlist(df[i, -1]))), 
                                                              decreasing = TRUE)+1),]
    

    which gives,

       person var1 var2 var3
    1      1    1    4    5
    3      3    2    3    4
    4      4    2    2    3
    2      2    5    4    4
    5      5    5    2    1