r aggregate aggregation large-data bigdata

Aggregating big data in R

I have a dataset (dat) that looks like this:

Team    Person      Performance1    Performance2
 1      36465930         1              101
 1      37236856         1              101
 1      34940210         1              101
 1      29135524         1              101
 2      10318268         1              541
 2      641793           1              541
 2      32352593         1              541
 2      2139024          1              541
 3      35193922         2              790
 3      32645504         2              890
 3      32304024         2              790
 3      22696491         2              790

I am trying to identify and remove all teams that have variance on Performance1 or Performance2. So, for example, team 3 in the example has variance on Performance 2, so I would want to remove that team from the dataset. Here is the code as I've written it:

tda <- aggregate(dat, by=list(data$Team), FUN=sd)
tda1 <- tda[ which(tda$Performance1 != 0 | tda$Performance2 != 0), ]

The problem is that there are over 100,000 teams in my dataset, so my first line of code is taking an extremely long time, and I'm not sure if it will ever finish aggregating the dataset. What would be a more efficient way to solve this problem?

Thanks in advance! :)

Sincerely, Amy

Solution

The dplyr package is generally very fast. Here's a way to select only those teams with standard deviation equal to zero for both Performance1 and Performance2:

library(dplyr)

datAggregated = dat %>%
  group_by(Team) %>%
  summarise(sdP1 = sd(Performance1),
            sdP2 = sd(Performance2)) %>%
  filter(sdP1==0 & sdP2==0)

datAggregated
  Team sdP1 sdP2
1    1    0    0
2    2    0    0