I have a dataset of parent offspring combinations:
key <- data.frame(parent = c("ID1", "ID2", "ID3"), infant = c("ID4", "ID5", "ID6"))
And a dataframe where the parent ID is referenced but a generic code for their infant is used.
df <- data.frame(time_point = 1:6, id = rep("ID2", 6), social1 = c(NA, NA, "inf", NA, NA, NA), social2 = c(NA, NA, NA, NA, NA, "inf"), social3 = c("inf", NA, NA, NA, NA, NA))
I would like to replace inf
with appropriate code using case_when
, however it seems like mutate and case_when can't reference the single value of df$id
in the same row where the value inf
occurs.
df %>%
mutate(social1 = case_when(
grepl("inf", social1) ~ key[key$parent == id,]$infant,
TRUE ~ social1
))
"Error in `mutate()`:
ℹ In argument: `social1 = case_when(...)`.
Caused by error in `case_when()`:
! Can't recycle `..1 (left)` (size 6) to match `..1 (right)` (size 2)."
I've tried a few other versions, but the problem always seems to be with referencing another column within the replacement value for case_when()
. Any explanations or creative solutions are welcome.
UPDATE: I should have shown my example in its full complexity. Sometimes "inf" is in the column in combination with other IDs.
key <- data.frame(parent = c("ID1", "ID2", "ID3"), infant = c("ID4", "ID5", "ID6"))
df <- data.frame(time_point = 1:6, id = rep("ID2", 6), social1 = c(NA, NA, "inf ID7", NA, NA, NA), social2 = c(NA, NA, NA, NA, NA, "inf"), social3 = c("ID8 inf", NA, NA, NA, NA, NA))
I use grepl to identify when "inf" occurs in the string. Both Gregor and TarJae's answers to the question work with the scenario I presented, however, I encounter problems if I want to use e.g.
gsub("inf",infant, social1)
as the RHS of case_when(). This is because gsub is not vectorized?
df %>%
left_join(., key, by = c("id" = "parent")) %>%
mutate(social1 = case_when(
grepl("inf", social1) ~ gsub("inf",infant, social1),
TRUE ~ social1
))
Warning message:
There was 1 warning in `mutate()`.
ℹ In argument: `social1 = case_when(...)`.
Caused by warning in `gsub()`:
! argument 'replacement' has length > 1 and only the first element will be used
Solution: as Gregor noted, the above code works if rowwise()
is called before mutate. It's slower but it works.
I would start with a join and then replace the values. This should be quite efficient. The main issue with your approach is that case_when
is vectorized, but your key[key$parent == id,]$infant
is not. It would probably work if you did it rowwise()
, but that would slow it down.
df |>
left_join(key, by = c("id" = "parent")) |>
mutate(across(starts_with("social"), \(x) {
case_when(x == "inf" ~ infant_id, .default = x)
})) |>
select(-infant_id)
# time_point id social1 social2 social3
# 1 1 ID2 <NA> <NA> ID5
# 2 2 ID2 <NA> <NA> <NA>
# 3 3 ID2 ID5 <NA> <NA>
# 4 4 ID2 <NA> <NA> <NA>
# 5 5 ID2 <NA> <NA> <NA>
# 6 6 ID2 <NA> ID5 <NA>
As a sidenote, don't use grepl
for whole-string matches when ==
will work. It's less efficient and can also produce unexpected results if any of your strings happen to include special regex characters.
Edit for the updated question:
For doing replacement in strings, sub
(and gsub
) are vectorized over the string, but not over the replacement. Instead we can use stringr::str_replace
, which is vectorized over both. And since we are pattern matching, we don't need case_when
or grepl
.
df %>%
left_join(., key, by = c("id" = "parent")) %>%
mutate(across(starts_with("social"), \(x) {
str_replace(x, pattern = "inf", replacement = infant)
}
))
# time_point id social1 social2 social3 infant
# 1 1 ID2 <NA> <NA> ID8 ID5 ID5
# 2 2 ID2 <NA> <NA> <NA> ID5
# 3 3 ID2 ID5 ID7 <NA> <NA> ID5
# 4 4 ID2 <NA> <NA> <NA> ID5
# 5 5 ID2 <NA> <NA> <NA> ID5
# 6 6 ID2 <NA> ID5 <NA> ID5