Search code examples
rsumcharacterpercentage

Count the total and percentage of values in character class column


In a large dataset I have a column (treatment) which contains numbers, sometimes more then one, but is classed as character.

treatment <- c("1", "1", "2", "5", "1,2", "2,5", "1,2,5", "3") 
df <- data.frame(treatment)

Each number represents a treatment. I want to count the total number and percentage of each treatment.

Desired output:

treatment   number     percent
   1           4          50
   2           4          50
   3           1          12,5
   5           3          37,5

I've tried to sum and to use group_by(treatment) %>% summarise(percent = 100 *n() / nrow(df) But i'm having trouble with the rows with multiple numbers and it's class character. Any tips?


Solution

  • You can use tidy::separate_longer_delim to separate out the combined values by "," - i.e.:

    library(dplyr)
    library(tidyr)
    
    separate_longer_delim(df, treatment, ",")
    #   treatment
    #1          1
    #2          1
    #3          2
    #4          5
    #5          1
    #6          2
    #7          2
    #8          5
    #9          1
    #10         2
    #11         5
    #12         3
    

    Then simply use dplyr::count() to count and dplyr::mutate to get the percents. Note, since you mentioned each row is a contact, you can create a simple element to identify the number of contacts.

    All together (and thanks to @DarrenTsai for the improved approach):

    contacts <- length(df$treatment)
    
    separate_longer_delim(df, treatment, ",") %>%
      count(treatment) %>%
      mutate(percent = n / contacts)
    

    Output

      treatment n percent
    1         1 4   0.500
    2         2 4   0.500
    3         3 1   0.125
    4         5 3   0.375
    

    Note, to get the same results with older versions of tidyr (<v1.3):

    xx <- max(nchar(gsub(",", "", df$treatment)))
    contacts <- length(df$treatment)
    
    df %>%
      separate(treatment, into = paste0("col", seq_len(xx)), sep = ",") %>%
      pivot_longer(everything(), values_drop_na = TRUE, names_to = NULL, values_to = "treatment") %>%
      count(treatment) %>%
      mutate(percent = n / contacts)
    
    #  treatment     n percent
    #  <chr>     <int>   <dbl>
    #1 1             4   0.5  
    #2 2             4   0.5  
    #3 3             1   0.125
    #4 5             3   0.375