Search code examples
rselectinner-jointapply

In two data.frames in R, find values in one data.frame, based on multiple variables in the other data.frame


The previous question was closed, but the easy solution doesn't seem to work. So I further explained my question here.

I have two dataframes, df1 and df2. df1 has a lot of raw data, df2 has pointers based on "value_a" where to look in the raw data.

df1 <- data.frame("id" = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3), "value_a" = c(0, 10, 21, 30, 43, 53, 69, 81, 93, 5, 16, 27, 33, 45, 61, 75, 90, 2, 11, 16, 24, 31, 40, 47, 60, 75, 88), "value_b" = c(100, 101, 100, 95, 90, 85, 88, 84, 75, 110, 105, 106, 104, 95, 98, 96, 89, 104, 104, 104, 103, 106, 103, 101, 99, 98, 97), "value_c" = c(0, -1, -2, -2, -2, -2, -1, -1, 0, 0, 0, 0, 1, 1, 2, 2, 1, -1, 0, 0, 1, 1, 2, 2, 1, 1, 0), "value_d" = c(1:27))
df2 <- data.frame("id" = c(1, 2, 3), "value_a" = c(53, 45, 47))

I would like to use the values in df2, to search in df1. So, for every "id", and it's unique "value_a" given in df2, find the corresponding "value_b" and "value_c" in df1, so I can generate a df3 which looks like this:

df3 <- data.frame("id" = c(1, 2, 3), "value_a" = c(53, 45, 47), "value_b" = c(85, 95, 101), "value_c" = c(-2, 1, 2))

Obviously, I have hundreds of "id"s to cover. Since I want to find multiple variables ("value_b", "value_c" but not "value_d") pull() won't work since it only pulls one variable. Based on this page I started thinking of joining. An innerjoin() wont work either, because I have to select on multiple variables (id & value_a). Merging like this

df3 <- merge(x = df1, y = df2, by.x = c(id, value_a), by.y = c(id, value_a)) %>%
  select(id, value_a, value_b, value_c)

is propably what describes what I'm thinking of, but this throws an error: Error in fix.by(by.x, x) : object 'value_a' not found

I was also thinking of using tapply() but I get stuck on using two different data.frames. Does someone have a good idea on how to tackle this?

Best regards,

Johan


Solution

  • I believe this will be useful:

    df2 %>% 
      inner_join(df1, by = c("id"="id", "value_a"="value_a"))
    

    Output:

      id value_a value_b value_c value_d
    1  1      53      85      -2       6
    2  2      45      95       1      14
    3  3      47     101       2      24