r if-statement data-cleaning purrr mutate

What gets passed to the mutate and modify?

I'm fairly new to R but not new to programming itself. I am using a simplified example of my code here. I have a dataframe that has three columns ( doc_id, tag_list, single_tag) all of which are characters.

df <- data.frame('doc_id' = c('A', 'B', 'C', 'D'),
                 'tag_list' = c("tagA1,tagA2,tagA3", "tagB1,tabB2", "tagC3, tagC4", "tagD1,tagD3,tagD4"),
                 'single_tag' = c("tagA2", NA, "tagC", NA)
                 )

Here is what I've been doing: If the value of single_tag is NA, I try to replace it with the value in tag_list.

df %>% mutate(single_tag = ifelse(is.na(single_tag), tag_list, single_tag))

This works as expected with the following output

  doc_id          tag_list        single_tag
1      A tagA1,tagA2,tagA3             tagA2
2      B       tagB1,tabB2       tagB1,tabB2
3      C      tagC3, tagC4              tagC
4      D tagD1,tagD3,tagD4 tagD1,tagD3,tagD4

Now I want to do the same thing again, but this time, I would like to replace the first value in tag_list if single_tag is NA (expected output below). Here's the code I try.

df %>% mutate(single_tag = ifelse(is.na(single_tag), str_split(tag_list, ",")[[1]][1], single_tag))

Expected output (** added for emphasis) :

  doc_id          tag_list single_tag
1      A tagA1,tagA2,tagA3      tagA2
2      B       tagB1,tabB2      **tagB1**
3      C      tagC3, tagC4       tagC
4      D tagD1,tagD3,tagD4      **tagD1**

Actual output (** added for emphasis):

  doc_id          tag_list single_tag
1      A tagA1,tagA2,tagA3      tagA2
2      B       tagB1,tabB2      **tagA1**
3      C      tagC3, tagC4       tagC
4      D tagD1,tagD3,tagD4      **tagA1**

I also tried this with modify_if

df <- df %>% mutate(single_tag = modify_if(.,is.na(single_tag), ~ str_split(tag_list, ",")[[1]][1], .else=single_tag))

I get the following error:

Error in `mutate()`:
ℹ In argument: `single_tag = modify_if(...)`.
Caused by error in `where_if()`:
! length(.p) == length(.x) is not TRUE

I did some digging and found that the length of .x is 3 and of the predicate .p is 4. I have discovered that .p produces a vector of four logical values one for each row in df. .x I presume is only getting the values of the three columns in one row.

While I know some way to achieve what I need, I need to understand what is going on these two cases. I feel like I'm using a traditional way of thinking of how functions and arguments work but somehow it's different in this case (because of vectorisation perhaps?). I tried reading up the documentation and the code but I am stumped.

I'm on R version 4.2.3 if that matters.

Any help would be appreciated!

Solution

Going through your examples in order:

library(tidyverse)

df %>% mutate(single_tag = ifelse(is.na(single_tag), str_split(tag_list, ",")[[1]][1], single_tag))

With this, it's instructive to look at the output of str_split(tag_list, ","):

str_split(df$tag_list, ",")
[[1]]
[1] "tagA1" "tagA2" "tagA3"

[[2]]
[1] "tagB1" "tabB2"

[[3]]
[1] "tagC3"  " tagC4"

[[4]]
[1] "tagD1" "tagD3" "tagD4"

As you can see, getting the first element of the first list is akin to getting the first thing in the first row of the dataframe, hence your result.

df <- df %>% mutate(single_tag = modify_if(.,is.na(single_tag), tag_list, .else=single_tag))

The issue with this is that .x (the first input of the modify_if), is, per the documentation, meant to be a vector, but you're passing a dataframe as the first input.

Solutions

Update : new solution by Ritchie Sacramento - use str_split_i():

df |> mutate(single_tag = ifelse(is.na(single_tag), str_split_i(tag_list, ",", 1), single_tag))

Original:

Use str_extract() to get everything before the first comma (^ is the start, . is any character, * means match it any number of times, ? makes sure it is not greedy (i.e. it doesn't just match the whole string if it doesn't have to), (?=,) is a look ahead for a comma)

df |> mutate(single_tag = ifelse(is.na(single_tag), tag_list, str_extract(tag_list, "^.*?(?=,)")))

split the tag_list column into an actual list column, then take the first element of that (using map()):

df |> mutate(tag_list = str_split(tag_list, ","),
             single_tag = ifelse(is.na(single_tag), map_chr(tag_list, 1), single_tag))

Use map2():

df |> mutate(single_tag = map2_chr(tag_list, single_tag, \(t, s) ifelse(is.na(s), str_split(t, ",")[[1]], s)))