Search code examples
rdplyrmutate

Replacing with conditional value in dplyr case_when()


I have a dataset of parent offspring combinations:

key <- data.frame(parent = c("ID1", "ID2", "ID3"), infant = c("ID4", "ID5", "ID6"))

And a dataframe where the parent ID is referenced but a generic code for their infant is used.

df <- data.frame(time_point = 1:6, id = rep("ID2", 6), social1 = c(NA, NA, "inf", NA, NA, NA), social2 = c(NA, NA, NA, NA, NA, "inf"), social3 = c("inf", NA, NA, NA, NA, NA))

I would like to replace inf with appropriate code using case_when, however it seems like mutate and case_when can't reference the single value of df$id in the same row where the value inf occurs.

df %>% 
  mutate(social1 = case_when(
    grepl("inf", social1) ~ key[key$parent == id,]$infant,
    TRUE ~ social1
  ))

"Error in `mutate()`:
ℹ In argument: `social1 = case_when(...)`.
Caused by error in `case_when()`:
! Can't recycle `..1 (left)` (size 6) to match `..1 (right)` (size 2)."

I've tried a few other versions, but the problem always seems to be with referencing another column within the replacement value for case_when(). Any explanations or creative solutions are welcome.

UPDATE: I should have shown my example in its full complexity. Sometimes "inf" is in the column in combination with other IDs.

key <- data.frame(parent = c("ID1", "ID2", "ID3"), infant = c("ID4", "ID5", "ID6"))
df <- data.frame(time_point = 1:6, id = rep("ID2", 6), social1 = c(NA, NA, "inf ID7", NA, NA, NA), social2 = c(NA, NA, NA, NA, NA, "inf"), social3 = c("ID8 inf", NA, NA, NA, NA, NA))

I use grepl to identify when "inf" occurs in the string. Both Gregor and TarJae's answers to the question work with the scenario I presented, however, I encounter problems if I want to use e.g.

gsub("inf",infant, social1) as the RHS of case_when(). This is because gsub is not vectorized?

df %>% 
  left_join(., key, by = c("id" = "parent")) %>%
  mutate(social1 = case_when(
    grepl("inf", social1) ~ gsub("inf",infant, social1),
    TRUE ~ social1
  ))

Warning message:
There was 1 warning in `mutate()`.
ℹ In argument: `social1 = case_when(...)`.
Caused by warning in `gsub()`:
! argument 'replacement' has length > 1 and only the first element will be used 
 

Solution: as Gregor noted, the above code works if rowwise() is called before mutate. It's slower but it works.


Solution

  • I would start with a join and then replace the values. This should be quite efficient. The main issue with your approach is that case_when is vectorized, but your key[key$parent == id,]$infant is not. It would probably work if you did it rowwise(), but that would slow it down.

    df |>
      left_join(key, by = c("id" = "parent")) |>
      mutate(across(starts_with("social"), \(x) {
        case_when(x == "inf" ~ infant_id, .default = x)
      })) |>
      select(-infant_id)
    #   time_point  id social1 social2 social3
    # 1          1 ID2    <NA>    <NA>     ID5
    # 2          2 ID2    <NA>    <NA>    <NA>
    # 3          3 ID2     ID5    <NA>    <NA>
    # 4          4 ID2    <NA>    <NA>    <NA>
    # 5          5 ID2    <NA>    <NA>    <NA>
    # 6          6 ID2    <NA>     ID5    <NA>
    

    As a sidenote, don't use grepl for whole-string matches when == will work. It's less efficient and can also produce unexpected results if any of your strings happen to include special regex characters.


    Edit for the updated question:

    For doing replacement in strings, sub (and gsub) are vectorized over the string, but not over the replacement. Instead we can use stringr::str_replace, which is vectorized over both. And since we are pattern matching, we don't need case_when or grepl.

    df %>% 
      left_join(., key, by = c("id" = "parent")) %>%
      mutate(across(starts_with("social"), \(x) {
        str_replace(x, pattern =  "inf", replacement = infant)
      }
      ))
    #   time_point  id social1 social2 social3 infant
    # 1          1 ID2    <NA>    <NA> ID8 ID5    ID5
    # 2          2 ID2    <NA>    <NA>    <NA>    ID5
    # 3          3 ID2 ID5 ID7    <NA>    <NA>    ID5
    # 4          4 ID2    <NA>    <NA>    <NA>    ID5
    # 5          5 ID2    <NA>    <NA>    <NA>    ID5
    # 6          6 ID2    <NA>     ID5    <NA>    ID5