Search code examples
rstringlistdataframedata-extraction

extracting the sum of numbers from a vector of strings in R after using str_extract_all()


I have a poorly formatted data frame with a vector of strings, e.g.

f<-data.frame(FruitQuantity=c("10 apple", "orange(15), bananas(30)", "cucumber-15",0,"not sure",NA))


> f
            FruitQuantity
1                10 apple
2 orange(15), bananas(30)
3             cucumber-15
4                       0
5                not sure
6                    <NA>

from which I wish to extract the sum of count data into another vector like so:

             FruitQuantity Total
1                10 apple    10
2 orange(15), bananas(30)    45
3             cucumber-15    15
4                       0     0
5                not sure    NA
6                    <NA>    NA

To extract the numeric data, I did the following

library(tidyverse)

f$SeperateCount<-str_extract_all(f$FruitQuantity,"\\d+")

Resulting:

>f
            FruitQuantity SeperateCount
1                10 apple            10
2 orange(15), bananas(30)        15, 30
3             cucumber-15            15
4                       0             0
5                not sure              
6                    <NA>            NA

> f$SeperateCount
[[1]]
[1] "10"

[[2]]
[1] "15" "30"

[[3]]
[1] "15"

[[4]]
[1] "0"

[[5]]
character(0)

[[6]]
[1] NA

it returned a list, which contains within it lists of numbers extracted as characters, e.g. c(15,30) in the second row and character(0) in the fifth row

to obtain the sum of elements in each list, I tried the following

f$Total<-sapply(f$SeperateCount,sum)

an error returned

Error in FUN(X[[i]], ...) : invalid 'type' (character) of argument

Then I tried converting the characters in the list into intergers

f$SeperateCountNumeric<-lapply(f$SeperateCount, function(x) if(all(grepl('^[0-9.]+$', x))) as.integer(x) else x)

> f$SeperateCountNumeric
[[1]]
[1] 10

[[2]]
[1] 15 30

[[3]]
[1] 15

[[4]]
[1] 0

[[5]]
integer(0)

[[6]]

[1] NA

> f
            FruitQuantity SeperateCount SeperateCountNumeric
1                10 apple            10                   10
2 orange(15), bananas(30)        15, 30               15, 30
3             cucumber-15            15                   15
4                       0             0                    0
5                not sure                                   
6                    <NA>            NA                   NA

but even after conversion to interger, the same character error still persists

> sapply(f$SeperateCountNumeric,sum)

Error in FUN(X[[i]], ...) : invalid 'type' (character) of argument

are there any alternative ways of doing this?

Thank you very much for the help!


Solution

  • With the help of the package stringr you can try this

    library(stringr)
    
    f$Total <- sapply(str_extract_all(f$FruitQuantity, "[[:digit:]]+"),
      function(x) ifelse(identical(x, character(0)),NA,sum(as.numeric(x))))
    
    f
                FruitQuantity Total
    1                10 apple    10
    2 orange(15), bananas(30)    45
    3             cucumber-15    15
    4                       0     0
    5                not sure    NA
    6                    <NA>    NA
    

    A similar base R solution

    f$Total <- sapply(strsplit(trimws(
      gsub("[[:alpha:](),-]","", f$FruitQuantity)), " "),
        function(x) ifelse(identical(x, character(0)),
                           NA,sum(as.numeric(x))))
    
    f
                FruitQuantity Total
    1                10 apple    10
    2 orange(15), bananas(30)    45
    3             cucumber-15    15
    4                       0     0
    5                not sure    NA
    6                    <NA>    NA