Search code examples
rdplyrtidy

Count occurrences of factors, comma separated, AND conditional? in R


I'm trying to do some complex calculations and part of the code requires that I parse a comma separated entry and count the number of values that are more than 0.

Example input data:

a <- c(0,0,3,0)
b <- c(4,4,0,1)
c <- c("3,4,3", "2,1", 0, "5,8")
x <- data.frame(a, b, c)

x

  a b     c
1 0 4 3,4,3
2 0 4   2,1
3 3 0     0
4 0 1   5,8

The column that I need to parse, c is factors and all other columns are numeric. The number of values comma separated will vary, in this example it varies from 0 to 3. The desired output would look like this:

x$c_occur <- c(3, 2, 0, 2)
x

  a b     c c_occur
1 0 4 3,4,3       3
2 0 4   2,1       2
3 3 0     0       0
4 0 1   5,8       2

Where c_occur lists the number of occurrences > 0 in the c column.

I was thinking something like this would work... but I can't figure it out.

library(dplyr
 x_desired <- x %>%
   mutate(c_occur = count(strsplit(c, ","), > 0))

Solution

  • We can make use of str_count

    library(stringr)
    library(dplyr)
    x %>%
        mutate(c_occur = str_count(c, '[1-9]\\d*'))
    #  a b     c c_occur
    #1 0 4 3,4,3       3
    #2 0 4   2,1       2
    #3 3 0     0       0
    #4 0 1   5,8       2
    

    After splitting the 'c', we can get the count by summing the logical vector after looping over the list output from strsplit

    library(purrr)    
    x %>%
       mutate(c_occur =  map_int(strsplit(as.character(c), ","),
           ~ sum(as.integer(.x) > 0)))
    # a b     c c_occur
    #1 0 4 3,4,3       3
    #2 0 4   2,1       2
    #3 3 0     0       0
    #4 0 1   5,8       2
    

    Or we can separate the rows with separate_rows and do a group_by summarise

    library(tidyr)    
    x %>%
         mutate(rn = row_number()) %>% 
         separate_rows(c, convert = TRUE) %>%
         group_by(rn) %>%
         summarise(c_occur = sum(c >0)) %>%
         select(-rn) %>% 
         bind_cols(x, .)
    # A tibble: 4 x 4
    #      a     b c     c_occur
    #  <dbl> <dbl> <fct>   <int>
    #1     0     4 3,4,3       3
    #2     0     4 2,1         2
    #3     3     0 0           0
    #4     0     1 5,8         2