Search code examples
rdplyr

why doesn't using ifelse() to create a new column based on matching observations in two dfs work for this data?


To preempt, I've figured out a solution to my problem using left_join() (or merge()), but I don't really understand why ifelse() doesn't work for this problem. Would love to hear any other ways I could do this or improve my left_join() use.

Basically, I'm trying to create a new column in a data frame df1 by matching the observations in a column df1$code to the observations in a corresponding column in an indexing df index.df$code. The new column df1$type would be the values from index.df$type that correspond to the df1$code value:

#index data frame
index.df <- data.frame(
  code = c("c10", "c20", "c03", "c48", "c19"),
  id = c("apple", "strawberry", "pear", "banana", "blackberry"),
  type = c("pome", "aggregate", "pome", "berry", "aggregate")
)
> index.df
  code         id      type
1  c10      apple      pome
2  c20 strawberry aggregate
3  c03       pear      pome
4  c48     banana     berry
5  c19  blackberry aggregate

#df to add col to
df1 <- data.frame(
  code = c("c10", "c19", "c03", "c20", "c19", "c10", "c48", "c03", "c10", "c03"),
  id = c("apple", "blackberry", "pear","strawberry", "blackberry", "apple", "banana", "pear", "apple", "pear")
)
> df1
   code         id
1   c10      apple
2   c19 blackberry
3   c03       pear
4   c20 strawberry
5   c19 blackberry
6   c10      apple
7   c48     banana
8   c03       pear
9   c10      apple
10  c03       pear

And this is the desired output

> df2
   code         id      type
1   c10      apple      pome
2   c19 blackberry aggregate
3   c03       pear      pome
4   c20 strawberry aggregate
5   c19 blackberry aggregate
6   c10      apple      pome
7   c48     banana     berry
8   c03       pear      pome
9   c10      apple      pome
10  c03       pear      pome

I tried ifelse() this way:

df2 <- df1 %>%
  mutate(df1, type = ifelse(df1$code == index.df$code, index.df$type, NA))

> df2
   no code         id      type
1   1  c10      apple      pome
2   2  c19 blackberry      <NA>
3   3  c03       pear      pome
4   4  c20 strawberry      <NA>
5   5  c19 blackberry aggregate
6   6  c10      apple      pome
7   7  c48     banana      <NA>
8   8  c03       pear      pome
9   9  c10      apple      <NA>
10 10  c03       pear      <NA>

Why is this the output? Am I using ifelse() incorrectly?

Also, the (rather bulky) code I used to get my desired output was:

df1 <- data.frame(
  no = 1:10,
  code = c("c10", "c19", "c03", "c20", "c19", "c10", "c48", "c03", "c10", "c03"),
  id = c("apple", "blackberry", "pear","strawberry", "blackberry", "apple", "banana", "pear", "apple", "pear")
)

df2 <- index.df %>%
  left_join(df1, by = c("code", "id")) %>%
  arrange(no) %>%
  select(-no)

Solution

  • I think that a left join is a very good solution to this problem since you can have multiple merge keys. If your only merge key is code then you can use a named vector to look up the value:

    library(dplyr)
    
    df1 |>
      mutate(type = pull(index.df, type, code)[code])
    

    The documentation ?pull shows that using pull() on two columns will create a named vector where the first column are the values and the second are the names:

    pull(index.df, type, code)
    #         c10         c20         c03         c48         c19 
    #      "pome" "aggregate"      "pome"     "berry" "aggregate" 
    

    Then you simply use the values from code from df1 to index ([code]) the correct value.

    Output

       code         id      type
    1   c10      apple      pome
    2   c19 blackberry aggregate
    3   c03       pear      pome
    4   c20 strawberry aggregate
    5   c19 blackberry aggregate
    6   c10      apple      pome
    7   c48     banana     berry
    8   c03       pear      pome
    9   c10      apple      pome
    10  c03       pear      pome
    

    Why is this the output? Am I using ifelse() incorrectly?

    Sort of, I actually think that you are using == "incorrectly". ifelse() is doing what it is supposed to do, but you have a bit of a misunderstanding that @Michiel Duvekot mentions. Hopefully this adds some detail to help you understand why:

    1. == works element-wise

    == returns (from ?==):

    A logical vector indicating the result of the element by element comparison. The elements of shorter vectors are recycled as necessary.

    I think this is intuitive when you have vectors of equal length:

    c(1, 1, 1) == c(1, 1, 2)
    [1]  TRUE  TRUE FALSE
    
    1. == recycles

    It may not be so intuitive when they are not the same length. As mentioned in the documentation above, shorter vectors are recycled. length(df1$code) is 10 and length(index.df$code) is 5. Since the length of index.df$code is shorter than df1$code the values are reused over again until it is the same length. This phenomenon is known as recycling.

    data.frame in base R also recycles if the longer vector is a multiple of the shorter (here df1$code is 2x as long as index.df$code):

    data.frame(df1_code = df1$code, indexdf_code = index.df$code)
    #    df1_code indexdf_code
    # 1       c10          c10
    # 2       c19          c20
    # 3       c03          c03
    # 4       c20          c48
    # 5       c19          c19
    # 6       c10          c10 # <- index.df$code restarts here to match length
    # 7       c48          c20
    # 8       c03          c03
    # 9       c10          c48
    # 10      c03          c19
    
    # Does not recycle: 10 is not a multiple of 3
    data.frame(df1_code = df1$code, indexdf_code = index.df$code[1:3])
    # Error in data.frame(df1_code = df1$code, indexdf_code = # index.df$code[1:3]) : 
    #   arguments imply differing number of rows: 10, 3
    

    With these two concepts in mind you can then see the output of your logical comparison:

    data.frame(df1_code = df1$code, indexdf_code = index.df$code, compare = df1$code == index.df$code)
    #    df1_code indexdf_code compare
    # 1       c10          c10    TRUE
    # 2       c19          c20   FALSE
    # 3       c03          c03    TRUE
    # 4       c20          c48   FALSE
    # 5       c19          c19    TRUE
    # 6       c10          c10    TRUE # <- index.df$code restarts here to match length
    # 7       c48          c20   FALSE
    # 8       c03          c03    TRUE
    # 9       c10          c48   FALSE
    # 10      c03          c19   FALSE
    

    ifelse() is simply returning the recycled value of type where your logical expression evaluates to TRUE and NA otherwise. By pure chance due to recycling row 8 happens to align.

    In fact, recycling multiples of length > 1 generates an error when you use tidyverse packages.