Search code examples
rcounttidyversepercentage

calculating counts and percentages of a variable


I am trying to create a count of the number of observations and extract percentages from those counts.

I have data that looks like this:

UserID              substance_use                     
43124                       0 
43124                       1
43124                       0 
43124                       0
43124                       1
215                         1
215                         1

I want a count of the total observations of substance use like this:

UserID             substance_use                                 count  
43124                       0                                        5
43124                       1                                        5
43124                       0                                        5
43124                       0                                        5
43124                       1                                        5
215                         1                                        2
215                         1                                        2

And then, I want the % of visits that participants said yes and no, like this:

UserID      substance_use      count      percent_yes       percent_no
43124        0                  5            40%               60%
43124        1                  5            40%               60%
43124        0                  5            40%               60%
43124        0                  5            40%               60%         
43124        1                  5            40%              0%  
215          1                  2            100%              0
215          1                  2            100%              0%

I tried using the count function for the first part, but it's not taking into account the 0's. Any help would be greatly appreciated.


Solution

  • We may use add_count to create the 'count' column and then get the mean of the binary column by group to return the percent_yes and subtract 100 from the percent_yes to return percent_no

    library(dplyr)
    df1 %>%
      add_count(UserID, name = 'count') %>%
      group_by(UserID) %>%
      mutate(percent_yes = 100 * mean(substance_use), 
           percent_no = 100 - percent_yes) %>% 
      ungroup
    

    -output

    # A tibble: 7 × 5
      UserID substance_use count percent_yes percent_no
       <int>         <int> <int>       <dbl>      <dbl>
    1  43124             0     5          40         60
    2  43124             1     5          40         60
    3  43124             0     5          40         60
    4  43124             0     5          40         60
    5  43124             1     5          40         60
    6    215             1     2         100          0
    7    215             1     2         100          0
    

    NOTE: Here, we assumed no missing values in 'substance_use' column

    data

    df1 <- structure(list(UserID = c(43124L, 43124L, 43124L, 43124L, 43124L, 
    215L, 215L), substance_use = c(0L, 1L, 0L, 0L, 1L, 1L, 1L)), 
    class = "data.frame", row.names = c(NA, 
    -7L))