I am trying to merge two data frames.
df1
contains company level data with columns company_name
and industry_code
. Say df1
looks like this:
company_name | industry_code |
---|---|
A | 33 |
B | 43 |
C | 56 |
D | 88 |
df2
contains industry_name
, min_code
, max_code
. Say df2
looks like this:
industry_name | min_code | max_code |
---|---|---|
Smoke | 10 | 19 |
Food | 20 | 39 |
Care | 40 | 69 |
Manufacturing | 70 | 99 |
I want to merge the df2
industry data to df1
company data. Such that, post-merge, the data looks like this:
company_name | industry_code | industry_name |
---|---|---|
A | 33 | Food |
B | 43 | Care |
C | 56 | Care |
D | 88 | Manufacturing |
Please see below my failed attempts:
# first attempt
df1 %>%
left_join(df2, by = c("industry_code" = "min_code"))
# second attempt
df1 %>%
mutate(industry_name = ifelse(between(industry_code, df2$min_code, df2$max_code), df2$industry_name, NA)
As Jon has already said, the simplest answer using dplyr is to use join_by()
:
df1 |>
left_join(df2, join_by(between(industry_code, min_code, max_code))) |>
select(-min_code, -max_code)
Output:
# A tibble: 4 × 3
company_name industry_code industry_name
<chr> <dbl> <chr>
1 A 33 Food
2 B 43 Care
3 C 56 Care
4 D 88 Manufacturing
The issue with the first attempt, is, (as I'm sure you're already aware), the industry codes don't match the min codes, so you get NAs.
The issue with the second attempt can be made clearer if you play around with it a bit:
between(c(88, 88, 88, 88), df2$min_code, df2$max_code) # [1] FALSE FALSE FALSE TRUE
between(c(33, 33, 33, 33), df2$min_code, df2$max_code) # [1] FALSE TRUE FALSE FALSE
between(c(33, 33, 33, 33, 33), df2$min_code, df2$max_code)
# Error in between(c(33, 33, 33, 33, 33), df2$min_code, df2$max_code) :
# Incompatible vector lengths: length(x)==5 length(lower)==4 length(upper)==4. Each should be either length 1 or the length of the longest.
The code isn't checking each industry code against every min and max value, it's checking the first industry code against the first min and the first max, and so on.
If you still wanted to use this approach, looping over industry_name with map_chr() works. Also, there are no gaps in your sample data, so if that's also the case with your real data, you could potentially use base R's findInterval()
e.g. df2$industry_name[findInterval(df1$industry_code, df2$min_code)]
.