Search code examples
rdplyrrowtidyverse

How to compare every row of dataframe to dataframe in R?


I want to get the number of values that are equal to every other row in dataframe:

library(tidyverse)

df <- tibble(
  a = c(1, 1, 5, 1),
  b = c(2, 3, 2, 8),
  c = c(2, 6, 2, 2)
)

desired output:

# A tibble: 4 x 4
      a     b     c desired_column
  <dbl> <dbl> <dbl> <list>        
1     1     2     2 <dbl [4]>     
2     1     3     6 <dbl [4]>     
3     5     2     2 <dbl [4]>     
4     1     8     2 <dbl [4]> 

enter image description here

In the column "desired_column": firt row: 3, 1, 2, 2:

3: is because the first row has the same three values compared to itself

1: is because there is one value with the same value in both rows and same column (first and second):

enter image description here

2: There are two values that are equal in first and third row and same column :

enter image description here

2: There are two values that are equal in first and fourth row and same column :

enter image description here

The second, third and fourth row of "desired_column" are results of the same process: The ith number in the result is the number of values in common between the current row and the ith row


Solution

  • You can do this: in short, with each row of the dataframe, duplicate it to create a new dataframe with all values changed to that row, and compare that dataframe with the original (whether the values are the same). rowSums of each of that comparison will give you the vectors you want.

    # Create the desired output in list 
    lst <- 
      lapply(1:nrow(df), function(nr) {
         rowSums(replicate(nrow(df), df[nr, ], simplify = FALSE) %>% 
                 do.call("rbind", .) == df)})
    
    # To create the desired dataframe
    df %>% tibble(desired_column = I(lst))
    

    In tibble call in the last row, I() is used to put in list output as a column.