Search code examples
rdplyrgroup-bypercentagesummarize

Find relative frequencies of summarized columns in R


I need to get the relative frequencies of a summarized column in R. I've used dplyr's summarize to find the total of each grouped row, like this:

data %>%
  group_by(x) %>%
  summarise(total = sum(dollars))

     x                    total 
   <chr>                 <dbl>
 1 expense 1              3600 
 2 expense 2              2150 
 3 expense 3              2000 

But now I need to create a new column for the relative frequencies of each total row to get this result:

     x                   total     p
   <chr>                 <dbl>   <dbl>
 1 expense 1              3600   46.45%
 2 expense 2              2150   27.74%
 3 expense 3              2000   25.81%

I've tried this:

data %>%
  group_by(x) %>%
  summarise(total = sum(dollars), p = scales::percent(total/sum(total))

and this:

data %>%
  group_by(x) %>%
  summarise(total = sum(dollars), p = total/sum(total)*100)

but the result is always this:

     x                   total     p
   <chr>                 <dbl>   <dbl>
 1 expense 1              3600    100%
 2 expense 2              2150    100%
 3 expense 3              2000    100%

The problem seems to be the summarized total column that may be affecting the results. Any ideas to help me? Thanks


Solution

  • You get 100% because of the grouping. However, after you've summarized, dplyr will drop the one level of grouping. Meaning that if you e.g. do mutate() after, you get the results you need:

    library(dplyr)
    
    data <- tibble(
      x = c("expense 1", "expense 2", "expense 3"),
      dollars = c(3600L, 2150L, 2000L)
    )
    
    
    data %>%
      group_by(x) %>%
      summarise(total = sum(dollars)) %>% 
      mutate(p = total/sum(total)*100)
    
    
    # A tibble: 3 x 3
      x         total     p
      <chr>     <int> <dbl>
    1 expense 1  3600  46.5
    2 expense 2  2150  27.7
    3 expense 3  2000  25.8