Search code examples
rdplyrsummarize

Percentage of total counts by group in R


I'm trying to create an output that calculates the percentage of counts, out of total counts (in a data frame), by factor level, but can't seem to figure out how to retain the grouping structure in the output.

I can get the total counts that I want to divide by...

df %>% summarise(sum(num))
# 15

...and the total by group...

df %>% group_by(species) %>% summarise(sum(num))
# A tibble: 3 × 2
#   species                  `sum(num)`
#   <chr>                         <int>
# 1 Farfantepenaeus duorarum          4
# 2 Farfantepenaeus notialis          0
# 3 Farfantepenaeus spp              11

But I can't get it to get it to look like this...

# ???
#   species                     Percent
#   <chr>                         <int>
# 1 Farfantepenaeus duorarum       4 / 15 = 0.267
# 2 Farfantepenaeus notialis       0 / 15 = 0.000
# 3 Farfantepenaeus spp           11 / 15 = 0.733

The closest I got was this, but because I used reframe() it returns the ungrouped data

df %>% group_by(species) %>% 
  summarise(factor_count=sum(num)) %>% 
  # ungroup() %>% 
  # Wanring: # Please use `reframe()` instead., When switching from `summarise()` 
  # to `reframe()`, remember that `reframe()` always returns an ungrouped data
  reframe(percent=factor_count/sum(df$num))

# A tibble: 3 × 1
  percent
    <dbl>
1   0.267
2   0    
3   0.733

Data:

> dput(df)
structure(list(species = c("Farfantepenaeus notialis", "Farfantepenaeus spp", 
"Farfantepenaeus notialis", "Farfantepenaeus notialis", "Farfantepenaeus duorarum", 
"Farfantepenaeus duorarum", "Farfantepenaeus notialis", "Farfantepenaeus spp", 
"Farfantepenaeus duorarum", "Farfantepenaeus spp", "Farfantepenaeus notialis", 
"Farfantepenaeus duorarum", "Farfantepenaeus spp", "Farfantepenaeus notialis", 
"Farfantepenaeus notialis", "Farfantepenaeus spp", "Farfantepenaeus duorarum", 
"Farfantepenaeus spp", "Farfantepenaeus spp", "Farfantepenaeus duorarum", 
"Farfantepenaeus duorarum", "Farfantepenaeus spp", "Farfantepenaeus spp", 
"Farfantepenaeus spp", "Farfantepenaeus notialis"), num = c(0L, 
0L, 0L, 0L, 1L, 0L, 0L, 2L, 0L, 3L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 3L, 0L, 2L, 4L, 0L)), row.names = c(159897L, 174698L, 
236857L, 190237L, 327321L, 272931L, 304567L, 75538L, 109206L, 
351373L, 280332L, 163966L, 282183L, 341197L, 316962L, 354703L, 
343971L, 95333L, 244258L, 254061L, 87561L, 186908L, 221318L, 
258688L, 97737L), class = "data.frame")

Solution

  • Two steps: summarize group-totals, then percent-calcs on everything combined.

    library(dplyr)
    df %>%
      summarize(Percent = sum(num), .by = species) %>%
      mutate(Percent = Percent / sum(Percent))
    #                    species   Percent
    # 1 Farfantepenaeus notialis 0.0000000
    # 2      Farfantepenaeus spp 0.7333333
    # 3 Farfantepenaeus duorarum 0.2666667
    

    For your code:

    • reframe is not necessary (mostly when the number of rows changes, it can often be used in place of summarise, but I haven't verified if/where the two differ significantly), and in fact here it will drop the species column
    • (Almost) Never use df$ in a pipe that starts with df: using df$num ignores anything you've done since the start of the pipe, meaning that grouping, filtering, additions/changes, etc, are not available in that version of df. There are certainly times when it is useful and even necessary, but they are few and far-between.