Search code examples
rjoindplyrmergeleft-join

Perform left_join between df1 and df2 where df2 gives a range for the joining variable in df1


I am trying to merge two data frames.

df1 contains company level data with columns company_name and industry_code. Say df1 looks like this:

company_name industry_code
A 33
B 43
C 56
D 88

df2 contains industry_name, min_code, max_code. Say df2 looks like this:

industry_name min_code max_code
Smoke 10 19
Food 20 39
Care 40 69
Manufacturing 70 99

I want to merge the df2 industry data to df1 company data. Such that, post-merge, the data looks like this:

company_name industry_code industry_name
A 33 Food
B 43 Care
C 56 Care
D 88 Manufacturing

Please see below my failed attempts:

# first attempt
df1 %>%
left_join(df2, by = c("industry_code" = "min_code"))

# second attempt
df1 %>%
mutate(industry_name = ifelse(between(industry_code, df2$min_code, df2$max_code), df2$industry_name, NA)




Solution

  • As Jon has already said, the simplest answer using dplyr is to use join_by():

    df1 |>
      left_join(df2, join_by(between(industry_code, min_code, max_code))) |>
      select(-min_code, -max_code)
    

    Output:

    # A tibble: 4 × 3
      company_name industry_code industry_name
      <chr>                <dbl> <chr>        
    1 A                       33 Food         
    2 B                       43 Care         
    3 C                       56 Care         
    4 D                       88 Manufacturing
    

    The issue with the first attempt, is, (as I'm sure you're already aware), the industry codes don't match the min codes, so you get NAs.

    The issue with the second attempt can be made clearer if you play around with it a bit:

    between(c(88, 88, 88, 88), df2$min_code, df2$max_code) # [1] FALSE FALSE FALSE  TRUE
    
    between(c(33, 33, 33, 33), df2$min_code, df2$max_code) # [1] FALSE  TRUE FALSE FALSE
    
    between(c(33, 33, 33, 33, 33), df2$min_code, df2$max_code) 
    # Error in between(c(33, 33, 33, 33, 33), df2$min_code, df2$max_code) : 
    #  Incompatible vector lengths: length(x)==5 length(lower)==4 length(upper)==4. Each should be either length 1 or the length of the longest.
    

    The code isn't checking each industry code against every min and max value, it's checking the first industry code against the first min and the first max, and so on.

    If you still wanted to use this approach, looping over industry_name with map_chr() works. Also, there are no gaps in your sample data, so if that's also the case with your real data, you could potentially use base R's findInterval() e.g. df2$industry_name[findInterval(df1$industry_code, df2$min_code)].