Say I have a huge source XDF file generated with RevoScaleR. I want to create a new target XDF by grouping the source entries on columns A, B, C and compute the sum, min, max, avg, std deviation on column D.
Let's assume the target data is too big to fit into memory too. How should I proceed? I could not find much information about group by operations in the documentation.
The dplyrXdf package lets you carry out dplyr operations like this on Xdf files.
library(dplyrXdf)
src <- RxXdfData("src.xdf")
dest <- src %>%
group_by(A, B, C) %>%
summarise(sum=sum(D), min=min(D), max=max(D), mean=mean(D), sd=sd(D))