I have a poorly formatted data frame with a vector of strings, e.g.
f<-data.frame(FruitQuantity=c("10 apple", "orange(15), bananas(30)", "cucumber-15",0,"not sure",NA))
> f
FruitQuantity
1 10 apple
2 orange(15), bananas(30)
3 cucumber-15
4 0
5 not sure
6 <NA>
from which I wish to extract the sum of count data into another vector like so:
FruitQuantity Total
1 10 apple 10
2 orange(15), bananas(30) 45
3 cucumber-15 15
4 0 0
5 not sure NA
6 <NA> NA
To extract the numeric data, I did the following
library(tidyverse)
f$SeperateCount<-str_extract_all(f$FruitQuantity,"\\d+")
Resulting:
>f
FruitQuantity SeperateCount
1 10 apple 10
2 orange(15), bananas(30) 15, 30
3 cucumber-15 15
4 0 0
5 not sure
6 <NA> NA
> f$SeperateCount
[[1]]
[1] "10"
[[2]]
[1] "15" "30"
[[3]]
[1] "15"
[[4]]
[1] "0"
[[5]]
character(0)
[[6]]
[1] NA
it returned a list, which contains within it lists of numbers extracted as characters, e.g. c(15,30)
in the second row and character(0)
in the fifth row
to obtain the sum of elements in each list, I tried the following
f$Total<-sapply(f$SeperateCount,sum)
an error returned
Error in FUN(X[[i]], ...) : invalid 'type' (character) of argument
Then I tried converting the characters in the list into intergers
f$SeperateCountNumeric<-lapply(f$SeperateCount, function(x) if(all(grepl('^[0-9.]+$', x))) as.integer(x) else x)
> f$SeperateCountNumeric
[[1]]
[1] 10
[[2]]
[1] 15 30
[[3]]
[1] 15
[[4]]
[1] 0
[[5]]
integer(0)
[[6]]
[1] NA
> f
FruitQuantity SeperateCount SeperateCountNumeric
1 10 apple 10 10
2 orange(15), bananas(30) 15, 30 15, 30
3 cucumber-15 15 15
4 0 0 0
5 not sure
6 <NA> NA NA
but even after conversion to interger, the same character error still persists
> sapply(f$SeperateCountNumeric,sum)
Error in FUN(X[[i]], ...) : invalid 'type' (character) of argument
are there any alternative ways of doing this?
Thank you very much for the help!
With the help of the package stringr
you can try this
library(stringr)
f$Total <- sapply(str_extract_all(f$FruitQuantity, "[[:digit:]]+"),
function(x) ifelse(identical(x, character(0)),NA,sum(as.numeric(x))))
f
FruitQuantity Total
1 10 apple 10
2 orange(15), bananas(30) 45
3 cucumber-15 15
4 0 0
5 not sure NA
6 <NA> NA
A similar base R solution
f$Total <- sapply(strsplit(trimws(
gsub("[[:alpha:](),-]","", f$FruitQuantity)), " "),
function(x) ifelse(identical(x, character(0)),
NA,sum(as.numeric(x))))
f
FruitQuantity Total
1 10 apple 10
2 orange(15), bananas(30) 45
3 cucumber-15 15
4 0 0
5 not sure NA
6 <NA> NA