Search code examples
raggregateaggregationlarge-databigdata

Aggregating big data in R


I have a dataset (dat) that looks like this:

Team    Person      Performance1    Performance2
 1      36465930         1              101
 1      37236856         1              101
 1      34940210         1              101
 1      29135524         1              101
 2      10318268         1              541
 2      641793           1              541
 2      32352593         1              541
 2      2139024          1              541
 3      35193922         2              790
 3      32645504         2              890
 3      32304024         2              790
 3      22696491         2              790

I am trying to identify and remove all teams that have variance on Performance1 or Performance2. So, for example, team 3 in the example has variance on Performance 2, so I would want to remove that team from the dataset. Here is the code as I've written it:

tda <- aggregate(dat, by=list(data$Team), FUN=sd)
tda1 <- tda[ which(tda$Performance1 != 0 | tda$Performance2 != 0), ]

The problem is that there are over 100,000 teams in my dataset, so my first line of code is taking an extremely long time, and I'm not sure if it will ever finish aggregating the dataset. What would be a more efficient way to solve this problem?

Thanks in advance! :)

Sincerely, Amy


Solution

  • The dplyr package is generally very fast. Here's a way to select only those teams with standard deviation equal to zero for both Performance1 and Performance2:

    library(dplyr)
    
    datAggregated = dat %>%
      group_by(Team) %>%
      summarise(sdP1 = sd(Performance1),
                sdP2 = sd(Performance2)) %>%
      filter(sdP1==0 & sdP2==0)
    
    datAggregated
      Team sdP1 sdP2
    1    1    0    0
    2    2    0    0