Search code examples
rdata.tabledisk.frame

My group by doesn't appear to be working in disk frames


I ran a group by on a large dataset (>20GB) and it doesn't appear to be working quite right

This is my code

mydf[, .(value = n_distinct(list_of_id, na.rm = T)),
                      by = .(week),
                      keep = c("list_of_id", "week")
                      ] 

It returned this error

Warning messages: 1: In serialize(data, node$con) :
'package:MLmetrics' may not be available when loading 2: In serialize(data, node$con) : 'package:MLmetrics' may not be available when loading 3: In serialize(data, node$con) : 'package:MLmetrics' may not be available when loading 4: In serialize(data, node$con) :
'package:MLmetrics' may not be available when loading 5: In serialize(data, node$con) : 'package:MLmetrics' may not be available when loading 6: In serialize(data, node$con) : 'package:MLmetrics' may not be available when loading 7: In serialize(data, node$con) :
'package:MLmetrics' may not be available when loading 8: In serialize(data, node$con) : 'package:MLmetrics' may not be available when loading

I had initially loaded the library but then I ran remove.packages(MLmetrics) before running this code. Additionally, I checked conflicted::conflict_scout and there aren't any conflicts that show up with the package MLmetrics.

When I run this code

> mydf %>% 
+   filter(week == "2012-01-02")

It gives me this output

         week    value 
1: 2012-01-02      483     
2: 2012-01-02     61233  

I'm concerned that something went wrong when it was grouping the data since it didn't create distinct groups of the value week. Both columns are stored as data types character.


Solution

  • Author of {disk.frame} here.

    The issue is that currently, {disk.frame} doesn't the group by within each chunk. It does not do group-by globally like how dplyr syntax would do.

    So you have to summarise it again to achieve what you want. So I suggest sticking with the dplyr syntax for now.

    As @Waldi pointed out, {disk.frame}'s dplyr syntax works fine, and currently support for data.table is lacking so you can only achieve what you want with dplyr syntax for now.

    {disk.frame} needs to implement https://github.com/xiaodaigh/disk.frame/issues/239 before it will work for data.table.

    Please DM me if anyone/organization would like to fund the development of this feature.