Search code examples
rdplyrcumsum

how best to calculate this share of a total


Below is the sample data. The goal is to first create a column that contains the total employment for that quarter. Second is to create a new column that shows the relative share for the area. Finally, the last item (and one which is vexing me) is to calculate whether the total with suppress = 0 represents over 50% of the total. I can do this in excel easily but trying to this in R and so have it be something that I can replicate year after year.

desired result is below

  area <- c("001","005","007","009","011","013","015","017","019","021","023","027","033","001","005","007","009","011","013","015","017","019","021","023","027","033")
 year <- c("2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021")
 qtr <- c("01","01","01","01","01","01","01","01","01","01","01","01","01","02","02","02","02","02","02","02","02","02","02","02","02","02")
  employment <- c(2,4,6,8,11,10,12,14,16,18,20,22,30,3,5,8,9,12,9,24,44,33,298,21,26,45)
  suppress <- c(0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0)

  testitem <- data.frame(year,qtr, area, employment, suppress)

For the first quarter of 2021, the total is 173. If you only take suppress = 1 into account, that is only 24 of 173 hence the TRUE in the 50 percent column. If these two values summed up to 173/2 or greater than you would have it say FALSE. For the second quarter, the suppress = 1 accounts for 310 of the total of 537 and so is over 50% of the total.

For the total column, I am showing the computation or ingredients. Ideally, it would show a value such as .0115 in place of 2/173.

 year    qtr   area     employment   suppress     total       50percent
2021     01     001        2           0          =2/173       TRUE
2021     01     005        4           0          =4/173       TRUE
.....
2021     02     001        3           0          =3/537       FALSE
2021     02     005        5           0          =5/537       FALSE

Solution

  • For example:

    library(dplyr)
    
    testitem %>% 
      group_by(year, qtr) %>% 
      mutate(
        total = employment / sum(employment),
        over_half = sum(employment[suppress == 0]) > (0.5 * sum(employment))
      )
    

    Gives:

    # A tibble: 26 × 7
    # Groups:   year, qtr [2]
       year  qtr   area  employment suppress  total over_half
       <chr> <chr> <chr>      <dbl>    <dbl>  <dbl> <lgl>    
     1 2021  01    001            2        0 0.0116 TRUE     
     2 2021  01    005            4        0 0.0231 TRUE     
     3 2021  01    007            6        0 0.0347 TRUE     
     4 2021  01    009            8        1 0.0462 TRUE     
     5 2021  01    011           11        0 0.0636 TRUE     
     6 2021  01    013           10        0 0.0578 TRUE     
     7 2021  01    015           12        0 0.0694 TRUE     
     8 2021  01    017           14        0 0.0809 TRUE     
     9 2021  01    019           16        1 0.0925 TRUE     
    10 2021  01    021           18        0 0.104  TRUE     
    # … with 16 more rows
    # ℹ Use `print(n = ...)` to see more rows