I'm fairly new to R but not new to programming itself. I am using a simplified example of my code here. I have a dataframe that has three columns ( doc_id, tag_list, single_tag) all of which are characters.
df <- data.frame('doc_id' = c('A', 'B', 'C', 'D'),
'tag_list' = c("tagA1,tagA2,tagA3", "tagB1,tabB2", "tagC3, tagC4", "tagD1,tagD3,tagD4"),
'single_tag' = c("tagA2", NA, "tagC", NA)
)
Here is what I've been doing: If the value of single_tag is NA, I try to replace it with the value in tag_list.
df %>% mutate(single_tag = ifelse(is.na(single_tag), tag_list, single_tag))
This works as expected with the following output
doc_id tag_list single_tag
1 A tagA1,tagA2,tagA3 tagA2
2 B tagB1,tabB2 tagB1,tabB2
3 C tagC3, tagC4 tagC
4 D tagD1,tagD3,tagD4 tagD1,tagD3,tagD4
Now I want to do the same thing again, but this time, I would like to replace the first value in tag_list if single_tag is NA (expected output below). Here's the code I try.
df %>% mutate(single_tag = ifelse(is.na(single_tag), str_split(tag_list, ",")[[1]][1], single_tag))
Expected output (** added for emphasis) :
doc_id tag_list single_tag
1 A tagA1,tagA2,tagA3 tagA2
2 B tagB1,tabB2 **tagB1**
3 C tagC3, tagC4 tagC
4 D tagD1,tagD3,tagD4 **tagD1**
Actual output (** added for emphasis):
doc_id tag_list single_tag
1 A tagA1,tagA2,tagA3 tagA2
2 B tagB1,tabB2 **tagA1**
3 C tagC3, tagC4 tagC
4 D tagD1,tagD3,tagD4 **tagA1**
I also tried this with modify_if
df <- df %>% mutate(single_tag = modify_if(.,is.na(single_tag), ~ str_split(tag_list, ",")[[1]][1], .else=single_tag))
I get the following error:
Error in `mutate()`:
ℹ In argument: `single_tag = modify_if(...)`.
Caused by error in `where_if()`:
! length(.p) == length(.x) is not TRUE
I did some digging and found that the length of .x is 3 and of the predicate .p is 4. I have discovered that .p produces a vector of four logical values one for each row in df. .x I presume is only getting the values of the three columns in one row.
While I know some way to achieve what I need, I need to understand what is going on these two cases. I feel like I'm using a traditional way of thinking of how functions and arguments work but somehow it's different in this case (because of vectorisation perhaps?). I tried reading up the documentation and the code but I am stumped.
I'm on R version 4.2.3 if that matters.
Any help would be appreciated!
Going through your examples in order:
library(tidyverse)
df %>% mutate(single_tag = ifelse(is.na(single_tag), str_split(tag_list, ",")[[1]][1], single_tag))
With this, it's instructive to look at the output of str_split(tag_list, ","):
str_split(df$tag_list, ",")
[[1]]
[1] "tagA1" "tagA2" "tagA3"
[[2]]
[1] "tagB1" "tabB2"
[[3]]
[1] "tagC3" " tagC4"
[[4]]
[1] "tagD1" "tagD3" "tagD4"
As you can see, getting the first element of the first list is akin to getting the first thing in the first row of the dataframe, hence your result.
df <- df %>% mutate(single_tag = modify_if(.,is.na(single_tag), tag_list, .else=single_tag))
The issue with this is that .x
(the first input of the modify_if
), is, per the documentation, meant to be a vector, but you're passing a dataframe as the first input.
Update : new solution by Ritchie Sacramento - use str_split_i()
:
df |> mutate(single_tag = ifelse(is.na(single_tag), str_split_i(tag_list, ",", 1), single_tag))
Original:
str_extract()
to get everything before the first comma (^
is the start, .
is any character, *
means match it any number of times, ?
makes sure it is not greedy (i.e. it doesn't just match the whole string if it doesn't have to), (?=,)
is a look ahead for a comma)df |> mutate(single_tag = ifelse(is.na(single_tag), tag_list, str_extract(tag_list, "^.*?(?=,)")))
tag_list
column into an actual list column, then take the first element of that (using map()
):df |> mutate(tag_list = str_split(tag_list, ","),
single_tag = ifelse(is.na(single_tag), map_chr(tag_list, 1), single_tag))
map2()
:df |> mutate(single_tag = map2_chr(tag_list, single_tag, \(t, s) ifelse(is.na(s), str_split(t, ",")[[1]], s)))