Search code examples
rdplyrstringr

command including str_extract_all not returning expected results within mutate()


I have a vector containing strings, each containing an alphanumeric code with integers having values 1-3 (ex. "1RV2GA"). I want to extract the numbers and get their sum. So for "1RV2GA", it should extract 1 and 2 and add them to get 3.

I have figured out how to do this on a single string:

str_extract_all(
"1RV2GA",  "\\(?[0-3,.]+\\)?", simplify = T) %>% 
as.numeric() %>% sum()

[1] 3

My problem is, I can't figure out how to get this to work across a whole vector. str_extract_all() returns a list, so that would obviously cause issues within mutate, but I just need a sum for each row.

To make some sample data, here:

test<-data.frame(ID=c("2VG1AR", "1OR2AG", "1GV1OA"),
                 value = c(4,8,2))
> test
      ID value
1 2VG1AR     4
2 1OR2AG     8
3 1GV1OA     2

Normally str_extract_all() would handle a vector like this, returning a list of character vectors:

> str_extract_all(test$ID, "\\(?[0-3,.]+\\)?")
[[1]]
[1] "2" "1"

[[2]]
[1] "1" "2"

[[3]]
[1] "1" "1"

But obviously, to get the sums of the output vectors for each input value, I need them to be numeric, or I need a function designed for an input that is an atomic vector. And if I try a mutate command with simplify=T, the sum of all the values in the ID vector is returned:

test %>% mutate(ID.numsum = 
str_extract_all(ID,  "\\(?[0-3,.]+\\)?", simplify = T) %>% 
as.numeric() %>% sum())

      ID value ID.numsum
1 2VG1AR     4         8
2 1OR2AG     8         8
3 1GV1OA     2         8

If I just try to take the first element of the str_extract_all() list output, it just returns the correct value for "2VG1AR" down the entire new vector.:

test%>%mutate(ID.numsum = str_extract_all(ID,  "\\(?[0-3,.]+\\)?")[[1]] %>% 
as.numeric() %>% sum())
# A tibble: 3 × 3
  ID     value ID.numsum
  <chr>  <dbl>     <dbl>
1 2VG1AR     4         3
2 1OR2AG     8         3
3 1GV1OA     2         3

str_extract() also doesn't work because it only extracts the first numeral in each string, so if I try it on "2VG1AR" it returns 2, where I need a vector including 2 and 1 so I can sum them to three.

Does anyone have a solution here?


Solution

  • sum() is a collapsing function. You have to be careful when using those function in a row-wise manner. You can explicitly map() over the lists. For example

    test %>% mutate(ID.numsum = 
                      purrr::map_int(stringr::str_extract_all(ID,  "\\(?[0-3,.]+\\)?"),
                      ~sum(as.numeric(.))))
    

    or you could use rowwise()

    test %>% 
      rowwise() %>% 
      mutate(ID.numsum = 
         stringr::str_extract_all(ID,  "\\(?[0-3,.]+\\)?") |> unlist() |> as.numeric() |> sum())