Search code examples
rdplyrequivalent

Is there a way to write this in a single Dplyr statement / more efficiently?


My (simplified) dataset consists of donor occupation and contribution amounts. I'm trying to determine what the average contribution amount by occupation is (note: donor occupations are often repeated in the column, so I use that as a grouping variable). Right now, I'm using two dplyr statements -- one to get a sum of contributions amount by each occupation and another to get a count of the number of donations from that specific occupation. I am then binding the dataframes with cbind and creating a new column with mutate, where I can divide the sum by the count.

Data example:

contributor_occupation contribution_receipt_amount
1                    LISTING COORDINATOR                        5.00
2                           NOT EMPLOYED                        2.70
3                                TEACHER                        2.70
4                    ELECTRICAL DESIGNER                        2.00
5                                STUDENT                       50.00
6                      SOFTWARE ENGINEER                       10.00
7                           TRUCK DRIVER                        2.70
8                           NOT EMPLOYED                       50.00
9                             CONTRACTOR                        5.00
10                              ENGINEER                        6.00
11                                FARMER                        2.70
12                                ARTIST                       50.00
13                         CIRCUS ARTIST                      100.00
14                         CIRCUS ARTIST                       27.00
15          INFORMATION SECURITY ANALYST                        2.00
16                                LAWYER                        5.00
occupation2 <- b %>%
  select(contributor_occupation, contribution_receipt_amount) %>%
  group_by(contributor_occupation) %>%
  summarise(total = sum(contribution_receipt_amount)) %>%
  arrange(desc(contributor_occupation))


occupation3 <- b %>%
  select(contributor_occupation) %>%
  count(contributor_occupation) %>%
  group_by(contributor_occupation) %>%
  arrange(desc(contributor_occupation))

final_occ <- cbind(occupation2, occupation3[, 2]) # remove duplicate column 

occ_avg <- final_occ %>% 
  select(contributor_occupation:n) %>%
  mutate("Average Donation" = total/n) %>%
  rename("Number of Donations"= n, "Occupation" = contributor_occupation, "Total Donated" = total)

occ_avg %>%
  arrange(desc(`Average Donation`))

This gives me the result I want but seems like a very cumbersome process. It seems I get the same result by using the following code; however, I am confused as to why it works:

avg_donation_occupation <- b %>%
  select(contributor_occupation, contribution_receipt_amount) %>%
  group_by(contributor_occupation) %>%
  summarize(avg_donation_by_occupation = sum(contribution_receipt_amount)/n()) %>%
  arrange(desc(avg_donation_by_occupation))

Wouldn't dividing by n divide by the number of rows (i.e., number of occupations) as opposed to the number of people in that occupation (which is what I used the count function for previously)?

Thanks for the help clearing up any confusion!


Solution

  • We may need both sum and mean along with n() which gives the number of observations in the grouped data. According to ?context

    n() gives the current group size.

    and `?mean

    mean - Generic function for the (trimmed) arithmetic mean.

    which is basically the sum of observations divided by the number of observations

    library(dplyr)
    out <- b %>% 
      group_by(Occupation = contributor_occupation) %>% 
      summarise(`Total Donated` = sum(contribution_receipt_amount), 
          `Number of Donations` = n(),
       `Average Donation` = mean(contribution_receipt_amount),
       #or
       #`Average Donation` = `Total Donated`/`Number of Donations`,    
          .groups = 'drop') %>% 
       
      arrange(desc(`Average Donation`))
    

    -output

    out
    # A tibble: 14 × 4
       Occupation                   `Total Donated` `Number of Donations` `Average Donation`
       <chr>                                  <dbl>                 <int>              <dbl>
     1 CIRCUS ARTIST                          127                       2               63.5
     2 ARTIST                                  50                       1               50  
     3 STUDENT                                 50                       1               50  
     4 NOT EMPLOYED                            52.7                     2               26.4
     5 SOFTWARE ENGINEER                       10                       1               10  
     6 ENGINEER                                 6                       1                6  
     7 CONTRACTOR                               5                       1                5  
     8 LAWYER                                   5                       1                5  
     9 LISTING COORDINATOR                      5                       1                5  
    10 FARMER                                   2.7                     1                2.7
    11 TEACHER                                  2.7                     1                2.7
    12 TRUCK DRIVER                             2.7                     1                2.7
    13 ELECTRICAL DESIGNER                      2                       1                2  
    14 INFORMATION SECURITY ANALYST             2                       1                2  
    

    data

    b <- structure(list(contributor_occupation = c("LISTING COORDINATOR", 
    "NOT EMPLOYED", "TEACHER", "ELECTRICAL DESIGNER", "STUDENT", 
    "SOFTWARE ENGINEER", "TRUCK DRIVER", "NOT EMPLOYED", "CONTRACTOR", 
    "ENGINEER", "FARMER", "ARTIST", "CIRCUS ARTIST", "CIRCUS ARTIST", 
    "INFORMATION SECURITY ANALYST", "LAWYER"), contribution_receipt_amount = c(5, 
    2.7, 2.7, 2, 50, 10, 2.7, 50, 5, 6, 2.7, 50, 100, 27, 2, 5)), 
    class = "data.frame", row.names = c("1", 
    "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", 
    "14", "15", "16"))