Search code examples
rdplyrgroup-bysummarize

More efficient way of using group_by > mutate > slice


I have a dataframe that looks like this

df <- data.frame("Month" = c("April","April","May","May","June","June","June"),
"ID" = c(11, 11, 12, 10, 11, 11, 11),
"Region" = c("East", "West", "North", "East", "North" ,"East", "West"),
"Qty" = c(120, 110, 110, 110, 100, 90, 70),
"Sales" = c(1000, 1100, 900, 1000, 1000, 800, 650),
"Leads" = c(10, 12, 9, 8, 6, 5, 4))

Month   ID     Region    Qty    Sales   Leads
April   11     East      120    1000    10
April   11     West      110    1100    12
May     12     North     110    900     9
May     10     East      110    1000    8
June    11     North     100    1000    6
June    11     East      90     800     5
June    11     West      70     650     4

I want a dataframe that looks like this

Month   ID     Qty     Sales   Leads   Region
April   11     230     2100    22      East
May     12     110     900     9       North
May     10     110     1000    8       East
June    11     260     2450    15      North

I am using a the following code

result <- df %>% group_by(Month, ID) %>% mutate(across(.cols = Qty:Leads, ~sum(.x, na.rm = T))) %>% slice(n = 1) 

result$Region <- NULL

I have over 2 million such rows and it is taking forever to calculate the aggregate.

I am using mutate and slice instead of summarize because the df is arranged in a certain way and I want to retain the Region in that first row.

However I think there could be a more efficient way. Please help on both. Can't figure it out for the life of me.


Solution

  • summarize makes more sense to me than mutate and slice. This should save you some time.

    library(dplyr)
    result <- df %>%
      group_by(Month, ID) %>%
      summarize(across(.cols = Qty:Leads, ~sum(.x, na.rm = T)),
                Region = first(Region))
    result
    # # A tibble: 4 x 6
    # # Groups:   Month [3]
    #   Month    ID   Qty Sales Leads Region
    #   <chr> <dbl> <dbl> <dbl> <dbl> <chr> 
    # 1 April    11   230  2100    22 East  
    # 2 June     11   260  2450    15 North 
    # 3 May      10   110  1000     8 East  
    # 4 May      12   110   900     9 North 
    

    Here is a data.table solution.

    library(data.table)
    
    setDT(df)
    
    cols <- c("Qty", "Sales", "Leads")
    
    df[, c(lapply(.SD, sum, na.rm = TRUE),
           Region = first(Region)), .SDcols = cols, 
       by = .(Month, ID)][]
    #    Month ID Qty Sales Leads Region
    # 1: April 11 230  2100    22   East
    # 2:   May 12 110   900     9  North
    # 3:   May 10 110  1000     8   East
    # 4:  June 11 260  2450    15  North