Search code examples
rgroup-byformatlabelrounding

Rounded percentages that add up to 100% in group_by statement


I'm having a hard time making rounded percentages that add up to 100% within groups.

Consider the following example:

# Loading main library used
library(dplyr)

# Creating the basic data frame
df = data.frame(group = c('A','A','A','A','B','B','B','B'),
                categories = c('Cat1','Cat2','Cat3','Cat4','Cat1','Cat2','Cat3','Cat4'),
                values = c(2200,4700,3000,2000,2900,4400,2200,1000))

print(df)
#   group categories values
# 1     A       Cat1   2200
# 2     A       Cat2   4700
# 3     A       Cat3   3000
# 4     A       Cat4   2000
# 5     B       Cat1   2900
# 6     B       Cat2   4400
# 7     B       Cat3   2200
# 8     B       Cat4   1000

df_with_shares = df %>%
  # Calculating group totals and adding them back to the main df
  left_join(df %>% 
              group_by(group) %>% 
              summarize(group_total = sum(values)),
            by='group') %>%
  # Calculating each category's share within the groups
  mutate(group_share = values / group_total,
         group_share_rounded = round(group_share,2))


# Summing the rounded shares within groups
rounded_totals = df_with_shares %>% 
  group_by(group) %>% 
  summarize(total_share = sum(group_share_rounded))


print(rounded_totals)
# # A tibble: 2 x 2
# group total_share
# <chr>       <int>
#   1 A        0.99
#   2 B        1.01
# Note how the totals do not add up to 100% as expected

I am aware of a few generic solutions to the "rounding percentages to add up to 100%" problem, as explained in this SO post. I was even able to make a little R implementation of one of those approaches, as seen here. This is what it would look like if I just applied that R approach to this problem:

df_with_rounded_shares = df %>% 
  mutate(
    percs = values / sum(values),
    percs_cumsum = cumsum(percs),
    percs_cumsum_round = round(percs_cumsum, 2),
    percs_cumsum_round_offset = replace_na(lag(percs_cumsum_round,1),0),
    percs_rounded_final = percs_cumsum_round - percs_cumsum_round_offset)

However, the method I devised in the thread above does not work as I would like. It just calculates the shares of the values column across the whole dataset. In other words, it does not take into consideration the grouping variable representing the multiple groups in the data, each of which need their rounded values to add up to 100% independently from every other group.

What can I do to generate a column of rounded percentages that add up to 100% by group?

PS: While writing this question I actually found something that worked, so I'll answer my own question below. I know it's super simple, but I think it's still worth having a direct answer here on SO addressing this issue.


Solution

  • The method devised in your implementation (from here) just needs a few small tweaks to make it work.

    First, include a group_by statement before calculating the new columns. Also, you need to use a reframe statement instead of the mutate statement you have now.

    In essence, this is what it'll look like:

    # Modified version of your implementation of the rounding procedure. 
    # The new procedure below accommodates for grouping variables.
    df_with_rounded_shares_by_group = df %>% 
      group_by(group) %>% 
      reframe(
        group_share = values / sum(values),
        group_share_cumsum = cumsum(group_share),
        group_share_cumsum_round = round(group_share_cumsum, 2),
        group_share_cumsum_round_offset = replace_na(lag(group_share_cumsum_round,1),0),
        group_share_rounded_final = group_share_cumsum_round - group_share_cumsum_round_offset) %>%
      # Removing unnecessary temporary columns
      select(-group_share_cumsum, -group_share_cumsum_round, -group_share_cumsum_round_offset)
    

    Doing the above will generate a tibble that looks like this:

      group group_share group_share_rounded_final
      <chr>       <dbl>                     <dbl>
    1 A          0.185                       0.18
    2 A          0.395                       0.4 
    3 A          0.252                       0.25
    4 A          0.168                       0.17
    5 B          0.276                       0.28
    6 B          0.419                       0.42
    7 B          0.210                       0.2 
    8 B          0.0952                      0.1 
    

    Then you can verify that the new rounded values really do add up to 100%:

    # Verifying if the results add up to 100% within each group
    rounded_totals = df_with_rounded_shares_by_group %>% 
      group_by(group) %>% 
      summarize(total_share = sum(group_share_rounded_final))
    
    print(rounded_totals)
    # # A tibble: 2 x 2
    # group total_share
    # <chr>       <dbl>
    #   1 A           1
    #   2 B           1
    
    # Yep, they all add up to 100% as expected!
    

    If you look closely, though, there is a small issues: the resulting table has lost the categories column.

    Here is the full code that fixes the issue. I admit that it's a bit of a cumbersome workaround and welcome any suggestions to improve this:

    df_with_rounded_shares_by_group = df %>% 
      mutate(join_col = row_number()) %>%
      left_join(
        df %>% 
          group_by(group) %>% 
          reframe(
            group_share = values / sum(values),
            group_share_cumsum = cumsum(group_share),
            group_share_cumsum_round = round(group_share_cumsum, 3),
            group_share_cumsum_round_offset = replace_na(lag(group_share_cumsum_round,1),0),
            group_share_rounded_final = group_share_cumsum_round - group_share_cumsum_round_offset) %>%
          # Removing unnecessary temporary columns
          ungroup() %>%
          select(-group,-group_share_cumsum, -group_share_cumsum_round, -group_share_cumsum_round_offset) %>%
          mutate(join_col = row_number()), 
        by='join_col') %>%
      select(-join_col)
    
    
    print(df_with_rounded_shares_by_group)
    
    >   group categories values group_share group_share_rounded_final
    > 1     A       Cat1   2200   0.1848739                     0.185
    > 2     A       Cat2   4700   0.3949580                     0.395
    > 3     A       Cat3   3000   0.2521008                     0.252
    > 4     A       Cat4   2000   0.1680672                     0.168
    > 5     B       Cat1   2900   0.2761905                     0.276
    > 6     B       Cat2   4400   0.4190476                     0.419
    > 7     B       Cat3   2200   0.2095238                     0.210
    > 8     B       Cat4   1000   0.0952381                     0.095
    

    Now all of the necessary columns are in your output and the group_share_rounded_final column really does add up to 100%.

    Btw, apologies for the ridiculously long column names. I just made them enormous to make it clear what each step was really doing.