Search code examples
rdplyrsummarize

Summarise but keep length variable (dplyr)


Basic dplyr question... Respondents could select multiple companies that they use. For example:

library(dplyr)
test <- tibble(
 CompanyA = rep(c(0:1),5),
 CompanyB = rep(c(1),10),
 CompanyC = c(1,1,1,1,0,0,1,1,1,1)
)
test

If it were a forced-choice question - i.e., respondents could make only one selection - I would do the following for a basic summary table:

test %>% 
  summarise_all(funs(sum), na.rm = TRUE) %>% 
  gather(Response, n) %>% 
  arrange(desc(n)) %>% 
  mutate("%" = round(100*n/sum(n)))

Note, however, that the "%" column is not what I want. I'm instead looking for the proportion of total respondents for each individual response option (since they could make multiple selections).

I've tried adding mutate(totalrows = nrow(.)) %>% prior to the summarise_all command. This would allow me to use that variable as the denominator in a later mutate command. However, summarise_all eliminates the "totalrows" var.

Also, if there's a better way to do this, I'm open to ideas.


Solution

  • To get the proportion of respondents who chose an option when that variable is binary, you can take the mean. To do this with your test data, you can use sapply:

    sapply(test, mean)
    CompanyA CompanyB CompanyC 
         0.5      1.0      0.8 
    

    If you wanted to do this in a more complicated fashion (say your data is not binary encoded, but is stored as 1 and 2 instead), you could do that with the following:

    test %>% 
        gather(key='Company') %>% 
        group_by(Company) %>% 
        summarise(proportion = sum(value == 1) / n())
    
    # A tibble: 3 x 2
      Company  proportion
      <chr>         <dbl>
    1 CompanyA        0.5
    2 CompanyB        1  
    3 CompanyC        0.8