Search code examples
rdatatabletidyversesimilarity

Calculate row similarity percentage pair wise and add it as a new column


I have a date frame like this sample, I would like to find similar rows (not duplicate) and calculate similarity per wise. I find this solution but i would like to keep all my columns and add similarity percentage as a new variable. My aim is to find records with highest similarity percentage. How could I do it ?

sample data set

df <- tibble::tribble(
     ~date, ~user_id, ~Station_id, ~location_id, ~ind_id, ~start_hour, ~start_minute, ~start_second, ~end_hour, ~end_minute, ~end_second, ~duration_min,
  20191015, 19900234,         242,            2,    "ac",           7,            25,             0,         7,          30,          59,             6,
  20191015, 19900234,         242,            2,    "ac",           7,            31,             0,         7,          32,          59,             2,
  20191015, 19900234,         242,            2,    "ac",           7,            33,             0,         7,          38,          59,             6,
  20191015, 19900234,         242,            2,    "ac",           7,            39,             0,         7,          40,          59,             2,
  20191015, 19900234,         242,            2,    "ac",           7,            41,             0,         7,          43,          59,             3,
  20191015, 19900234,         242,            2,    "ac",           7,            44,             0,         7,          45,          59,             2,
  20191015, 19900234,         242,            2,    "ac",           7,            47,             0,         7,          59,          59,            13,
  20191015, 19900234,         242,            2,    "ad",           7,            47,             0,         7,          59,          59,            13,
  20191015, 19900234,         242,            2,    "ac",           8,             5,             0,         8,           6,          59,             2,
  20191015, 19900234,         242,            2,    "ad",           8,             5,             0,         8,           6,          59,             2,
  20191015, 19900234,         242,            2,    "ac",           8,             7,             0,         8,           8,          59,             2,
  20191015, 19900234,         242,            2,    "ad",           8,             7,             0,         8,           8,          59,             2,
  20191015, 19900234,         242,            2,    "ac",          16,            26,             0,        16,          55,          59,            30,
  20191015, 19900234,         242,            2,    "ad",          16,            26,             0,        16,          55,          59,            30,
  20191015, 19900234,         242,            2,    "ad",          17,             5,             0,        17,           6,          59,             2,
  20191015, 19900234,         242,            2,    "ac",          17,             5,             0,        17,          23,          59,            19,
  20191015, 19900234,         242,            2,    "ad",          17,             7,             0,        17,          15,          59,             9,
  20191015, 19900234,         242,            2,    "ad",          17,            16,             0,        17,          22,          59,             7,
  20191015, 19900234,         264,            2,    "ac",          17,            24,             0,        17,          35,          59,            12,
  20191015, 19900234,         264,            2,    "ad",          17,            25,             0,        17,          35,          59,            11,
  20191016, 19900234,         242,            1,    "ac",           7,            12,             0,         7,          14,          59,             3,
  20191016, 19900234,         242,            1,    "ad",           7,            13,             0,         7,          13,          59,             1,
  20191016, 19900234,         242,            1,    "ac",          17,            45,             0,        17,          49,          59,             5,
  20191016, 19900234,         242,            1,    "ad",          17,            46,             0,        17,          48,          59,             3,
  20191016, 19900234,         242,            2,    "ad",           7,            14,             0,         8,           0,          59,            47,
  20191016, 19900234,         242,            2,    "ac",           7,            15,             0,         8,           0,          59,            47
  )

Function for comparing rows

row_cf <- function(x, y, df){
  sum(df[x,] == df[y,])/ncol(df)
}

Function output

# 1) Create all possible row combinations
# 2) Rename 
# 3) Run through each row
# 4) Calculate similarity

expand.grid(1:nrow(df), 1:nrow(df)) %>% 
  rename(row_1 = Var1, row_2 = Var2) %>% 
  rowwise() %>% 
  mutate(similarity = row_cf(row_1, row_2, df))


# A tibble: 676 x 3
   row_1 row_2 similarity
   <int> <int>      <dbl>
 1     1     1      1    
 2     2     1      0.75 
 3     3     1      0.833
 4     4     1      0.75 
 5     5     1      0.75 
 6     6     1      0.75 
 7     7     1      0.75 
 8     8     1      0.667
 9     9     1      0.583
10    10     1      0.5 

Edit: I would like to find similar rows in the data like here enter image description here


Solution

  • Using your "function output", call it sim. Eliminate the self-comparisons and then keep the max similarity row grouped by row_1:

    sim = sim %>% 
      filter(row_1 != row_2) %>%
      group_by(row_1) %>% 
      slice(which.max(similarity))
    

    Then you can add these to your original data:

    df %>% mutate(row_1 = 1:n()) %>%
      left_join(sim)
    

    The row_2 column gives the row number of the most similar row, and similarity gives its similarity score. (You may want to improve these column names.)