Perform left_join between df1 and df2 where df2 gives a range for the joining variable in df1

I am trying to merge two data frames.

df1 contains company level data with columns company_name and industry_code. Say df1 looks like this:

company_name	industry_code
A	33
B	43
C	56
D	88

df2 contains industry_name, min_code, max_code. Say df2 looks like this:

industry_name	min_code	max_code
Smoke	10	19
Food	20	39
Care	40	69
Manufacturing	70	99

I want to merge the df2 industry data to df1 company data. Such that, post-merge, the data looks like this:

company_name	industry_code	industry_name
A	33	Food
B	43	Care
C	56	Care
D	88	Manufacturing

Please see below my failed attempts:

# first attempt
df1 %>%
left_join(df2, by = c("industry_code" = "min_code"))

# second attempt
df1 %>%
mutate(industry_name = ifelse(between(industry_code, df2$min_code, df2$max_code), df2$industry_name, NA)

Solution

As Jon has already said, the simplest answer using dplyr is to use join_by():

df1 |>
  left_join(df2, join_by(between(industry_code, min_code, max_code))) |>
  select(-min_code, -max_code)

Output:

# A tibble: 4 × 3
  company_name industry_code industry_name
  <chr>                <dbl> <chr>        
1 A                       33 Food         
2 B                       43 Care         
3 C                       56 Care         
4 D                       88 Manufacturing

The issue with the first attempt, is, (as I'm sure you're already aware), the industry codes don't match the min codes, so you get NAs.

The issue with the second attempt can be made clearer if you play around with it a bit:

between(c(88, 88, 88, 88), df2$min_code, df2$max_code) # [1] FALSE FALSE FALSE  TRUE

between(c(33, 33, 33, 33), df2$min_code, df2$max_code) # [1] FALSE  TRUE FALSE FALSE

between(c(33, 33, 33, 33, 33), df2$min_code, df2$max_code) 
# Error in between(c(33, 33, 33, 33, 33), df2$min_code, df2$max_code) : 
#  Incompatible vector lengths: length(x)==5 length(lower)==4 length(upper)==4. Each should be either length 1 or the length of the longest.

The code isn't checking each industry code against every min and max value, it's checking the first industry code against the first min and the first max, and so on.

If you still wanted to use this approach, looping over industry_name with map_chr() works. Also, there are no gaps in your sample data, so if that's also the case with your real data, you could potentially use base R's findInterval() e.g. df2$industry_name[findInterval(df1$industry_code, df2$min_code)].