Search code examples
rmatching

Match rows with other row with smallest difference on a column


I want to perform matching between two groups in a data frame, where all rows belonging to one group (binary) are matched with observations from the other group (with replacement) if their difference on another column is smaller than a pre-set threshold. Let's use the toy-dataset below:

set.seed(123)

df <- data.frame(id = c(1:10),
                 group = rbinom(10,1, 0.3),
                 value = round(runif(10),2))

threshold <- round(sd(df$value),2)

Which looks like this

> df
   id group value
1   1     0  0.96
2   2     1  0.45
3   3     0  0.68
4   4     1  0.57
5   5     1  0.10
6   6     0  0.90
7   7     0  0.25
8   8     1  0.04
9   9     0  0.33
10 10     0  0.95

> threshold 
[1] 0.35

In this case, I want to match rows with group==1 with rows with group==2 where the difference between value is smaller than threshold(0.35). This should lead to a data frame looking like this (apologizes for potential error, did it manually).

   id matched_id
1   2          3
2   2          7
3   2          9
4   4          3
5   4          6
6   4          7
7   4          9
8   5          7
9   5          9
10  8          7
11  8          9    

Thank you!


Solution


  • UPDATED ANSWER: Was going slow on a larger dataset, so I tried to make the code a bit more efficient.

    Came up with a solution that seems to do what I want. Not sure how efficient this code is on larger data but seems to work.

    library(tidyverse)
    library(data.table)
    
    # All values
    dist_mat <- df$value
    
    # Adding identifier
    names(dist_mat) <- df$id
    
    # Dropping combinations that are not of interest
    dist_mat_col <-dist_mat[df$group == 0]
    dist_mat_row <- dist_mat[df$group == 1]
    
    # Difference between each value
    dist_mat <- abs(outer(dist_mat_row, dist_mat_col, "-"))
    
    # Identifying matches that fulfills the criteria
    dist_mat <- dist_mat <= threshold 
    
    # From matrix to a long dataframe
    dist_mat <- melt(dist_mat)
    
    # Tidying up the dataframe and dropping unneccecary columns and rows. 
    dist_mat <- dist_mat %>%
      rename(id = Var1,
             matched_id = Var2,
             cond = value) %>%
      filter(cond == TRUE) %>%
      left_join(df, by = "id") %>%
      select(id, matched_id)
    

    This leads to the following dataframe:

    > arrange(dist_mat, id)
       id matched_id
    1   2          3
    2   2          7
    3   2          9
    4   4          3
    5   4          6
    6   4          7
    7   4          9
    8   5          7
    9   5          9
    10  8          7
    11  8          9