Here is what my data frame looks like:
And here is the dput structure of it.
structure(list(tier_1 = c("Organic Search", "Organic Search",
"Organic Search", "Organic Search", "Organic Search", "Organic Search",
"Organic Search", "Organic Search", "Organic Search", "Organic Search",
"Organic Social", "Organic Social", "Organic Social", "Organic Social",
"Organic Social", "Organic Social", "Organic Social", "Paid Search",
"Paid Search", "Paid Search", "Paid Search", "Paid Search", "Paid Search",
"Paid Search", "Paid Search", "Paid Search", "Paid Social", "Paid Social",
"Paid Social", "Paid Social", "Paid Social", "Paid Social", "Paid Social",
"Paid Social", "Paid Social"), sequence_number = c(1L, 2L, 3L,
4L, 5L, 6L, 7L, 8L, 9L, 10L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L,
9L), count_of_sequence_numbers = c(1176L, 460L, 119L, 41L, 21L,
5L, 8L, 6L, 2L, 1L, 133L, 52L, 11L, 2L, 2L, 1L, 1L, 7516L, 1090L,
284L, 90L, 36L, 21L, 12L, 6L, 2L, 1979L, 674L, 99L, 30L, 11L,
2L, 3L, 2L, 1L), percent = c(0.637744034707158, 0.249457700650759,
0.0645336225596529, 0.022234273318872, 0.0113882863340564, 0.0027114967462039,
0.00433839479392625, 0.00325379609544469, 0.00108459869848156,
0.000542299349240781, 0.655172413793103, 0.25615763546798, 0.0541871921182266,
0.00985221674876847, 0.00985221674876847, 0.00492610837438424,
0.00492610837438424, 0.827662151745402, 0.120030833608633, 0.0312740887567449,
0.00991080277502478, 0.00396432111000991, 0.00231252064750578,
0.0013214403700033, 0.000660720185001652, 0.000220240061667217,
0.704019921736037, 0.23977232301672, 0.0352187833511206, 0.0106723585912487,
0.00391319815012451, 0.000711490572749911, 0.00106723585912487,
0.000711490572749911, 0.000355745286374956)), row.names = c(NA,
-35L), groups = structure(list(tier_1 = c("Organic Search", "Organic Social",
"Paid Search", "Paid Social"), .rows = structure(list(1:10, 11:17,
18:26, 27:35), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
df <- df %>%
group_by(tier_1, sequence_number) %>%
summarize(count_of_sequence_numbers = length(sequence_number)) %>%
mutate(percent = count_of_sequence_numbers / sum(count_of_sequence_numbers)) %>%
filter(sequence_number <= 10)
I was able to come up with the percent column by using the code above, specifically the part about the count / sum(count).
I do have a problem, though, which is that the percent is incorrect. The values from count_of_sequence_numbers, when referring to sequence_number = 2 should be subtracted from the values from count_of_sequence_numbers when sequence_number = 1 (within the same category). And everything from count_of_sequence_numbers, when referring to sequence_number = 3 should be subtracted from both the count_of_sequence_numbers when sequence_number = 2 and when sequence_number = 3.
What I'm saying is, I really need a count of sequence numbers that, for sequence_number = 1, doesn't include 2-10, and when it's 2, doesn't include 3-10, etc. The 1176 value should really be 1176 - 460 - 119 - 41 - 21 - 5 -8 - 6 -2 -1. And the 460 value should be 460 - 119 - 41 - 21 - 5 -8 - 6 -2 -1. And then the percent should be calculated off that.
I tried a lead function but I just don't think this can be the efficient way. :/ That -1175 number in particular is making me nervous.
df <- df %>%
group_by(tier_1) %>%
arrange(tier_1, sequence_number) %>%
mutate(diff = count_of_sequence_numbers - lead(count_of_sequence_numbers, default = first(count_of_sequence_numbers)))
If I change to lead(count_of_sequence_numbers, default = 0)) I get better behavior but it's still not quite what I'm trying to do, which is to subtract the value by the sum of all the others in the same group that have a greater sequence number.
Is this the output you're looking for?
df %>%
arrange(tier_1, -sequence_number) %>%
group_by(tier_1) %>% # already grouped this way, only including for clarity
mutate(cuml = cumsum(lag(count_of_sequence_numbers, default = 0)),
diff = count_of_sequence_numbers - cuml) %>%
ungroup()
## A tibble: 35 x 6
# tier_1 sequence_number count_of_sequence_numbers percent cuml diff
# <chr> <int> <int> <dbl> <dbl> <dbl>
# 1 Organic Search 10 1 0.000542 0 1
# 2 Organic Search 9 2 0.00108 1 1
# 3 Organic Search 8 6 0.00325 3 3
# 4 Organic Search 7 8 0.00434 9 -1
# 5 Organic Search 6 5 0.00271 17 -12
# 6 Organic Search 5 21 0.0114 22 -1
# 7 Organic Search 4 41 0.0222 43 -2
# 8 Organic Search 3 119 0.0645 84 35
# 9 Organic Search 2 460 0.249 203 257
#10 Organic Search 1 1176 0.638 663 513
## … with 25 more rows