I ran a group by on a large dataset (>20GB) and it doesn't appear to be working quite right
This is my code
mydf[, .(value = n_distinct(list_of_id, na.rm = T)),
by = .(week),
keep = c("list_of_id", "week")
]
It returned this error
Warning messages: 1: In serialize(data, node$con) :
'package:MLmetrics' may not be available when loading 2: In serialize(data, node$con) : 'package:MLmetrics' may not be available when loading 3: In serialize(data, node$con) : 'package:MLmetrics' may not be available when loading 4: In serialize(data, node$con) :
'package:MLmetrics' may not be available when loading 5: In serialize(data, node$con) : 'package:MLmetrics' may not be available when loading 6: In serialize(data, node$con) : 'package:MLmetrics' may not be available when loading 7: In serialize(data, node$con) :
'package:MLmetrics' may not be available when loading 8: In serialize(data, node$con) : 'package:MLmetrics' may not be available when loading
I had initially loaded the library but then I ran remove.packages(MLmetrics) before running this code. Additionally, I checked conflicted::conflict_scout and there aren't any conflicts that show up with the package MLmetrics.
When I run this code
> mydf %>%
+ filter(week == "2012-01-02")
It gives me this output
week value
1: 2012-01-02 483
2: 2012-01-02 61233
I'm concerned that something went wrong when it was grouping the data since it didn't create distinct groups of the value week. Both columns are stored as data types character.
Author of {disk.frame} here.
The issue is that currently, {disk.frame} doesn't the group by within
each chunk. It does not do group-by globally like how dplyr syntax would do.
So you have to summarise it again to achieve what you want. So I suggest sticking with the dplyr syntax for now.
As @Waldi pointed out, {disk.frame}
's dplyr syntax works fine, and currently support for data.table is lacking so you can only achieve what you want with dplyr syntax for now.
{disk.frame} needs to implement https://github.com/xiaodaigh/disk.frame/issues/239 before it will work for data.table.
Please DM me if anyone/organization would like to fund the development of this feature.