I want to perform matching between two groups in a data frame, where all rows belonging to one group (binary) are matched with observations from the other group (with replacement) if their difference on another column is smaller than a pre-set threshold. Let's use the toy-dataset below:
set.seed(123)
df <- data.frame(id = c(1:10),
group = rbinom(10,1, 0.3),
value = round(runif(10),2))
threshold <- round(sd(df$value),2)
Which looks like this
> df
id group value
1 1 0 0.96
2 2 1 0.45
3 3 0 0.68
4 4 1 0.57
5 5 1 0.10
6 6 0 0.90
7 7 0 0.25
8 8 1 0.04
9 9 0 0.33
10 10 0 0.95
> threshold
[1] 0.35
In this case, I want to match rows with group==1
with rows with group==2
where the difference between value
is smaller than threshold
(0.35). This should lead to a data frame looking like this (apologizes for potential error, did it manually).
id matched_id
1 2 3
2 2 7
3 2 9
4 4 3
5 4 6
6 4 7
7 4 9
8 5 7
9 5 9
10 8 7
11 8 9
Thank you!
UPDATED ANSWER: Was going slow on a larger dataset, so I tried to make the code a bit more efficient.
Came up with a solution that seems to do what I want. Not sure how efficient this code is on larger data but seems to work.
library(tidyverse)
library(data.table)
# All values
dist_mat <- df$value
# Adding identifier
names(dist_mat) <- df$id
# Dropping combinations that are not of interest
dist_mat_col <-dist_mat[df$group == 0]
dist_mat_row <- dist_mat[df$group == 1]
# Difference between each value
dist_mat <- abs(outer(dist_mat_row, dist_mat_col, "-"))
# Identifying matches that fulfills the criteria
dist_mat <- dist_mat <= threshold
# From matrix to a long dataframe
dist_mat <- melt(dist_mat)
# Tidying up the dataframe and dropping unneccecary columns and rows.
dist_mat <- dist_mat %>%
rename(id = Var1,
matched_id = Var2,
cond = value) %>%
filter(cond == TRUE) %>%
left_join(df, by = "id") %>%
select(id, matched_id)
This leads to the following dataframe:
> arrange(dist_mat, id)
id matched_id
1 2 3
2 2 7
3 2 9
4 4 3
5 4 6
6 4 7
7 4 9
8 5 7
9 5 9
10 8 7
11 8 9