Search code examples
rdplyrconditional-statementssample

R: conditionally mutate a variable when columns match in different dataframes


I am attempting to write some R code that assesses whether or not two dataframes have any matches in their columns. If there are matches, one of the columns in the second dataframe should assign a "link" (via the links variable) to the first dataframe using the id column of the first dataframe.

In the event that there are multiple matches, I am trying to get the "link" variable to randomly select one of the matching id's.

Some reproducible code:

library(dplyr)

df1 = data.frame(ids = c(1:5),
                 var = c("a","a","c","b","b"))

df2 = data.frame(var = c('c','a','b','b','d'),
                 links = 0)

Ideally, I would like a resulting dataframe that looks like:

  var    links
1   c        3
2   a   1 or 2
3   b   4 or 5
4   b   4 or 5
5   d        0

where observations in the links column randomly select ids from df1 when df1$var matches df2$var. In the dataframe above, this is denoted by "or".

Note 1: The links column should be a numeric, I only made it character to allow to write the word "or".

Note 2: If there is not a match between df1$var and df2$var, the links column should remain a 0.

So far, I've gone this route, but I'm unsure about what to put after the ~

linked_df = df2 %>%
     mutate(links=case_when(links==0 & var %in% df1$var ~ 
                            sample(c(df1$ids),n(),replace=T) # unsure about this line
                            TRUE ~ links)

Solution

  • I think this is what you want. I've left the ids column in the result, but it can be removed when the sampling is complete.

    library(dplyr)
    library(tidyr)
    
    df1_nest = df1 %>%
      group_by(var) %>% 
      summarize(ids = list(ids))
    
    safe_sample = function(x, ...) {
      if(length(x) == 1) return(x)
      sample(x, ...)
    }
    
    set.seed(47)
    df2 %>%
      left_join(df1_nest) %>%
      mutate(
        links = sapply(ids, \(x) if(is.null(x)) 0L else safe_sample(x, size = 1))
      )
    # Joining, by = "var"
    #   var links  ids
    # 1   c     3    3
    # 2   a     1 1, 2
    # 3   b     4 4, 5
    # 4   b     5 4, 5
    # 5   d     0 NULL