To preempt, I've figured out a solution to my problem using left_join()
(or merge()
), but I don't really understand why ifelse()
doesn't work for this problem. Would love to hear any other ways I could do this or improve my left_join()
use.
Basically, I'm trying to create a new column in a data frame df1
by matching the observations in a column df1$code
to the observations in a corresponding column in an indexing df index.df$code
. The new column df1$type
would be the values from index.df$type
that correspond to the df1$code
value:
#index data frame
index.df <- data.frame(
code = c("c10", "c20", "c03", "c48", "c19"),
id = c("apple", "strawberry", "pear", "banana", "blackberry"),
type = c("pome", "aggregate", "pome", "berry", "aggregate")
)
> index.df
code id type
1 c10 apple pome
2 c20 strawberry aggregate
3 c03 pear pome
4 c48 banana berry
5 c19 blackberry aggregate
#df to add col to
df1 <- data.frame(
code = c("c10", "c19", "c03", "c20", "c19", "c10", "c48", "c03", "c10", "c03"),
id = c("apple", "blackberry", "pear","strawberry", "blackberry", "apple", "banana", "pear", "apple", "pear")
)
> df1
code id
1 c10 apple
2 c19 blackberry
3 c03 pear
4 c20 strawberry
5 c19 blackberry
6 c10 apple
7 c48 banana
8 c03 pear
9 c10 apple
10 c03 pear
And this is the desired output
> df2
code id type
1 c10 apple pome
2 c19 blackberry aggregate
3 c03 pear pome
4 c20 strawberry aggregate
5 c19 blackberry aggregate
6 c10 apple pome
7 c48 banana berry
8 c03 pear pome
9 c10 apple pome
10 c03 pear pome
I tried ifelse()
this way:
df2 <- df1 %>%
mutate(df1, type = ifelse(df1$code == index.df$code, index.df$type, NA))
> df2
no code id type
1 1 c10 apple pome
2 2 c19 blackberry <NA>
3 3 c03 pear pome
4 4 c20 strawberry <NA>
5 5 c19 blackberry aggregate
6 6 c10 apple pome
7 7 c48 banana <NA>
8 8 c03 pear pome
9 9 c10 apple <NA>
10 10 c03 pear <NA>
Why is this the output? Am I using ifelse()
incorrectly?
Also, the (rather bulky) code I used to get my desired output was:
df1 <- data.frame(
no = 1:10,
code = c("c10", "c19", "c03", "c20", "c19", "c10", "c48", "c03", "c10", "c03"),
id = c("apple", "blackberry", "pear","strawberry", "blackberry", "apple", "banana", "pear", "apple", "pear")
)
df2 <- index.df %>%
left_join(df1, by = c("code", "id")) %>%
arrange(no) %>%
select(-no)
I think that a left join is a very good solution to this problem since you can have multiple merge keys. If your only merge key is code
then you can use a named vector to look up the value:
library(dplyr)
df1 |>
mutate(type = pull(index.df, type, code)[code])
The documentation ?pull
shows that using pull()
on two columns will create a named vector where the first column are the values and the second are the names:
pull(index.df, type, code)
# c10 c20 c03 c48 c19
# "pome" "aggregate" "pome" "berry" "aggregate"
Then you simply use the values from code
from df1
to index ([code]
) the correct value.
Output
code id type
1 c10 apple pome
2 c19 blackberry aggregate
3 c03 pear pome
4 c20 strawberry aggregate
5 c19 blackberry aggregate
6 c10 apple pome
7 c48 banana berry
8 c03 pear pome
9 c10 apple pome
10 c03 pear pome
Why is this the output? Am I using ifelse()
incorrectly?
Sort of, I actually think that you are using ==
"incorrectly". ifelse()
is doing what it is supposed to do, but you have a bit of a misunderstanding that @Michiel Duvekot mentions. Hopefully this adds some detail to help you understand why:
==
works element-wise==
returns (from ?==
):
A logical vector indicating the result of the element by element comparison. The elements of shorter vectors are recycled as necessary.
I think this is intuitive when you have vectors of equal length:
c(1, 1, 1) == c(1, 1, 2)
[1] TRUE TRUE FALSE
==
recyclesIt may not be so intuitive when they are not the same length. As mentioned in the documentation above, shorter vectors are recycled. length(df1$code)
is 10 and length(index.df$code)
is 5. Since the length of index.df$code
is shorter than df1$code
the values are reused over again until it is the same length. This phenomenon is known as recycling.
data.frame
in base R also recycles if the longer vector is a multiple of the shorter (here df1$code
is 2x as long as index.df$code
):
data.frame(df1_code = df1$code, indexdf_code = index.df$code)
# df1_code indexdf_code
# 1 c10 c10
# 2 c19 c20
# 3 c03 c03
# 4 c20 c48
# 5 c19 c19
# 6 c10 c10 # <- index.df$code restarts here to match length
# 7 c48 c20
# 8 c03 c03
# 9 c10 c48
# 10 c03 c19
# Does not recycle: 10 is not a multiple of 3
data.frame(df1_code = df1$code, indexdf_code = index.df$code[1:3])
# Error in data.frame(df1_code = df1$code, indexdf_code = # index.df$code[1:3]) :
# arguments imply differing number of rows: 10, 3
With these two concepts in mind you can then see the output of your logical comparison:
data.frame(df1_code = df1$code, indexdf_code = index.df$code, compare = df1$code == index.df$code)
# df1_code indexdf_code compare
# 1 c10 c10 TRUE
# 2 c19 c20 FALSE
# 3 c03 c03 TRUE
# 4 c20 c48 FALSE
# 5 c19 c19 TRUE
# 6 c10 c10 TRUE # <- index.df$code restarts here to match length
# 7 c48 c20 FALSE
# 8 c03 c03 TRUE
# 9 c10 c48 FALSE
# 10 c03 c19 FALSE
ifelse()
is simply returning the recycled value of type
where your logical expression evaluates to TRUE
and NA
otherwise. By pure chance due to recycling row 8 happens to align.
In fact, recycling multiples of length > 1 generates an error when you use tidyverse packages.