I'm having a hard time making rounded percentages that add up to 100% within groups.
Consider the following example:
# Loading main library used
library(dplyr)
# Creating the basic data frame
df = data.frame(group = c('A','A','A','A','B','B','B','B'),
categories = c('Cat1','Cat2','Cat3','Cat4','Cat1','Cat2','Cat3','Cat4'),
values = c(2200,4700,3000,2000,2900,4400,2200,1000))
print(df)
# group categories values
# 1 A Cat1 2200
# 2 A Cat2 4700
# 3 A Cat3 3000
# 4 A Cat4 2000
# 5 B Cat1 2900
# 6 B Cat2 4400
# 7 B Cat3 2200
# 8 B Cat4 1000
df_with_shares = df %>%
# Calculating group totals and adding them back to the main df
left_join(df %>%
group_by(group) %>%
summarize(group_total = sum(values)),
by='group') %>%
# Calculating each category's share within the groups
mutate(group_share = values / group_total,
group_share_rounded = round(group_share,2))
# Summing the rounded shares within groups
rounded_totals = df_with_shares %>%
group_by(group) %>%
summarize(total_share = sum(group_share_rounded))
print(rounded_totals)
# # A tibble: 2 x 2
# group total_share
# <chr> <int>
# 1 A 0.99
# 2 B 1.01
# Note how the totals do not add up to 100% as expected
I am aware of a few generic solutions to the "rounding percentages to add up to 100%" problem, as explained in this SO post. I was even able to make a little R implementation of one of those approaches, as seen here. This is what it would look like if I just applied that R approach to this problem:
df_with_rounded_shares = df %>%
mutate(
percs = values / sum(values),
percs_cumsum = cumsum(percs),
percs_cumsum_round = round(percs_cumsum, 2),
percs_cumsum_round_offset = replace_na(lag(percs_cumsum_round,1),0),
percs_rounded_final = percs_cumsum_round - percs_cumsum_round_offset)
However, the method I devised in the thread above does not work as I would like. It just calculates the shares of the values
column across the whole dataset. In other words, it does not take into consideration the grouping variable representing the multiple groups in the data, each of which need their rounded values to add up to 100% independently from every other group.
What can I do to generate a column of rounded percentages that add up to 100% by group?
PS: While writing this question I actually found something that worked, so I'll answer my own question below. I know it's super simple, but I think it's still worth having a direct answer here on SO addressing this issue.
The method devised in your implementation (from here) just needs a few small tweaks to make it work.
First, include a group_by
statement before calculating the new columns. Also, you need to use a reframe
statement instead of the mutate
statement you have now.
In essence, this is what it'll look like:
# Modified version of your implementation of the rounding procedure.
# The new procedure below accommodates for grouping variables.
df_with_rounded_shares_by_group = df %>%
group_by(group) %>%
reframe(
group_share = values / sum(values),
group_share_cumsum = cumsum(group_share),
group_share_cumsum_round = round(group_share_cumsum, 2),
group_share_cumsum_round_offset = replace_na(lag(group_share_cumsum_round,1),0),
group_share_rounded_final = group_share_cumsum_round - group_share_cumsum_round_offset) %>%
# Removing unnecessary temporary columns
select(-group_share_cumsum, -group_share_cumsum_round, -group_share_cumsum_round_offset)
Doing the above will generate a tibble that looks like this:
group group_share group_share_rounded_final
<chr> <dbl> <dbl>
1 A 0.185 0.18
2 A 0.395 0.4
3 A 0.252 0.25
4 A 0.168 0.17
5 B 0.276 0.28
6 B 0.419 0.42
7 B 0.210 0.2
8 B 0.0952 0.1
Then you can verify that the new rounded values really do add up to 100%:
# Verifying if the results add up to 100% within each group
rounded_totals = df_with_rounded_shares_by_group %>%
group_by(group) %>%
summarize(total_share = sum(group_share_rounded_final))
print(rounded_totals)
# # A tibble: 2 x 2
# group total_share
# <chr> <dbl>
# 1 A 1
# 2 B 1
# Yep, they all add up to 100% as expected!
If you look closely, though, there is a small issues: the resulting table has lost the categories
column.
Here is the full code that fixes the issue. I admit that it's a bit of a cumbersome workaround and welcome any suggestions to improve this:
df_with_rounded_shares_by_group = df %>%
mutate(join_col = row_number()) %>%
left_join(
df %>%
group_by(group) %>%
reframe(
group_share = values / sum(values),
group_share_cumsum = cumsum(group_share),
group_share_cumsum_round = round(group_share_cumsum, 3),
group_share_cumsum_round_offset = replace_na(lag(group_share_cumsum_round,1),0),
group_share_rounded_final = group_share_cumsum_round - group_share_cumsum_round_offset) %>%
# Removing unnecessary temporary columns
ungroup() %>%
select(-group,-group_share_cumsum, -group_share_cumsum_round, -group_share_cumsum_round_offset) %>%
mutate(join_col = row_number()),
by='join_col') %>%
select(-join_col)
print(df_with_rounded_shares_by_group)
> group categories values group_share group_share_rounded_final
> 1 A Cat1 2200 0.1848739 0.185
> 2 A Cat2 4700 0.3949580 0.395
> 3 A Cat3 3000 0.2521008 0.252
> 4 A Cat4 2000 0.1680672 0.168
> 5 B Cat1 2900 0.2761905 0.276
> 6 B Cat2 4400 0.4190476 0.419
> 7 B Cat3 2200 0.2095238 0.210
> 8 B Cat4 1000 0.0952381 0.095
Now all of the necessary columns are in your output and the group_share_rounded_final
column really does add up to 100%.
Btw, apologies for the ridiculously long column names. I just made them enormous to make it clear what each step was really doing.