Search code examples
rdplyrlaglead

R lead lag function summarize within group and calculate percent


Here is what my data frame looks like:

enter image description here

And here is the dput structure of it.

structure(list(tier_1 = c("Organic Search", "Organic Search", 
"Organic Search", "Organic Search", "Organic Search", "Organic Search", 
"Organic Search", "Organic Search", "Organic Search", "Organic Search", 
"Organic Social", "Organic Social", "Organic Social", "Organic Social", 
"Organic Social", "Organic Social", "Organic Social", "Paid Search", 
"Paid Search", "Paid Search", "Paid Search", "Paid Search", "Paid Search", 
"Paid Search", "Paid Search", "Paid Search", "Paid Social", "Paid Social", 
"Paid Social", "Paid Social", "Paid Social", "Paid Social", "Paid Social", 
"Paid Social", "Paid Social"), sequence_number = c(1L, 2L, 3L, 
4L, 5L, 6L, 7L, 8L, 9L, 10L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 1L, 
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 
9L), count_of_sequence_numbers = c(1176L, 460L, 119L, 41L, 21L, 
5L, 8L, 6L, 2L, 1L, 133L, 52L, 11L, 2L, 2L, 1L, 1L, 7516L, 1090L, 
284L, 90L, 36L, 21L, 12L, 6L, 2L, 1979L, 674L, 99L, 30L, 11L, 
2L, 3L, 2L, 1L), percent = c(0.637744034707158, 0.249457700650759, 
0.0645336225596529, 0.022234273318872, 0.0113882863340564, 0.0027114967462039, 
0.00433839479392625, 0.00325379609544469, 0.00108459869848156, 
0.000542299349240781, 0.655172413793103, 0.25615763546798, 0.0541871921182266, 
0.00985221674876847, 0.00985221674876847, 0.00492610837438424, 
0.00492610837438424, 0.827662151745402, 0.120030833608633, 0.0312740887567449, 
0.00991080277502478, 0.00396432111000991, 0.00231252064750578, 
0.0013214403700033, 0.000660720185001652, 0.000220240061667217, 
0.704019921736037, 0.23977232301672, 0.0352187833511206, 0.0106723585912487, 
0.00391319815012451, 0.000711490572749911, 0.00106723585912487, 
0.000711490572749911, 0.000355745286374956)), row.names = c(NA, 
-35L), groups = structure(list(tier_1 = c("Organic Search", "Organic Social", 
"Paid Search", "Paid Social"), .rows = structure(list(1:10, 11:17, 
    18:26, 27:35), ptype = integer(0), class = c("vctrs_list_of", 
"vctrs_vctr", "list"))), row.names = c(NA, -4L), class = c("tbl_df", 
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df", 
"tbl_df", "tbl", "data.frame"))

df <- df %>% 
  group_by(tier_1, sequence_number) %>%
  summarize(count_of_sequence_numbers = length(sequence_number)) %>%
  mutate(percent = count_of_sequence_numbers / sum(count_of_sequence_numbers)) %>%
  filter(sequence_number <= 10)

I was able to come up with the percent column by using the code above, specifically the part about the count / sum(count).

I do have a problem, though, which is that the percent is incorrect. The values from count_of_sequence_numbers, when referring to sequence_number = 2 should be subtracted from the values from count_of_sequence_numbers when sequence_number = 1 (within the same category). And everything from count_of_sequence_numbers, when referring to sequence_number = 3 should be subtracted from both the count_of_sequence_numbers when sequence_number = 2 and when sequence_number = 3.

What I'm saying is, I really need a count of sequence numbers that, for sequence_number = 1, doesn't include 2-10, and when it's 2, doesn't include 3-10, etc. The 1176 value should really be 1176 - 460 - 119 - 41 - 21 - 5 -8 - 6 -2 -1. And the 460 value should be 460 - 119 - 41 - 21 - 5 -8 - 6 -2 -1. And then the percent should be calculated off that.

I tried a lead function but I just don't think this can be the efficient way. :/ That -1175 number in particular is making me nervous.

df <- df %>%
    group_by(tier_1) %>%
    arrange(tier_1, sequence_number) %>%
    mutate(diff = count_of_sequence_numbers - lead(count_of_sequence_numbers, default = first(count_of_sequence_numbers)))

enter image description here

If I change to lead(count_of_sequence_numbers, default = 0)) I get better behavior but it's still not quite what I'm trying to do, which is to subtract the value by the sum of all the others in the same group that have a greater sequence number.

enter image description here


Solution

  • Is this the output you're looking for?

    df %>%
      arrange(tier_1, -sequence_number) %>%
      group_by(tier_1) %>%   # already grouped this way, only including for clarity
      mutate(cuml = cumsum(lag(count_of_sequence_numbers, default = 0)),
             diff = count_of_sequence_numbers - cuml) %>%
      ungroup()
    
    
    ## A tibble: 35 x 6
    #   tier_1         sequence_number count_of_sequence_numbers  percent  cuml  diff
    #   <chr>                    <int>                     <int>    <dbl> <dbl> <dbl>
    # 1 Organic Search              10                         1 0.000542     0     1
    # 2 Organic Search               9                         2 0.00108      1     1
    # 3 Organic Search               8                         6 0.00325      3     3
    # 4 Organic Search               7                         8 0.00434      9    -1
    # 5 Organic Search               6                         5 0.00271     17   -12
    # 6 Organic Search               5                        21 0.0114      22    -1
    # 7 Organic Search               4                        41 0.0222      43    -2
    # 8 Organic Search               3                       119 0.0645      84    35
    # 9 Organic Search               2                       460 0.249      203   257
    #10 Organic Search               1                      1176 0.638      663   513
    ## … with 25 more rows