Search code examples
rdplyrh2o

Aggregating Max using h2o in R


I have started using h2o for aggregating large datasets and I have found peculiar behaviour when trying to aggregate the maximum value using h2o's h2o.group_by function. My dataframe often has variables which comprise some or all NA's for a given grouping. Below is an example dataframe.

df <- data.frame("ID" = 1:16)
df$Group<- c(1,1,1,1,2,2,2,3,3,3,4,4,5,5,5,5)
df$VarA <- c(NA_real_,1,2,3,12,12,12,12,0,14,NA_real_,14,16,16,NA_real_,16)
df$VarB <- c(NA_real_,NA_real_,NA_real_,NA_real_,10,12,14,16,10,12,14,16,10,12,14,16)
df$VarD <- c(10,12,14,16,10,12,14,16,10,12,14,16,10,12,14,16)

   ID Group VarA VarB VarD
1   1     1   NA   NA   10
2   2     1    1   NA   12
3   3     1    2   NA   14
4   4     1    3   NA   16
5   5     2   12   10   10
6   6     2   12   12   12
7   7     2   12   14   14
8   8     3   12   16   16
9   9     3    0   10   10
10 10     3   14   12   12
11 11     4   NA   14   14
12 12     4   14   16   16
13 13     5   16   10   10
14 14     5   16   12   12
15 15     5   NA   14   14
16 16     5   16   16   16

In this dataframe Group == 1 is completely missing data for VarB (but this is important information to know, so the output for aggregating for the maximum should be NA), while for Group == 1 VarA only has one missing value so the maximum should be 3.

This is a link which includes the behaviour of the behaviour of the na.methods argument (https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-munging/groupby.html).

If I set the na.methods = 'all' as below then the aggregated output is NA for Group 1 for both Vars A and B (which is not what I want, but I completely understand this behaviour).

h2o_agg <-  h2o.group_by(data = df_h2o, by = 'Group', max(), gb.control = list(na.methods = "all"))

  Group max_ID max_VarA max_VarB max_VarD
1     1      4      NaN      NaN       16
2     2      7       12       14       14
3     3     10       14       16       16
4     4     12      NaN       16       16
5     5     16      NaN       16       16

If I set the na.methods = 'rm' as below then the aggregated output for Group 1 is 3 for VarA (which is the desired output and makes complete sense) but for VarB is -1.80e308 (which is not what I want, and I do not understand this behaviour).

h2o_agg <-  h2o.group_by(data = df_h2o, by = 'Group', max(), gb.control = list(na.methods = "rm"))

  Group max_ID max_VarA  max_VarB max_VarD
  <int>  <int>    <int>     <dbl>    <int>
1     1      4        3 -1.80e308       16
2     2      7       12  1.4 e  1       14
3     3     10       14  1.6 e  1       16
4     4     12       14  1.6 e  1       16
5     5     16       16  1.6 e  1       16

Similarly I get the same output if set the na.methods = 'ignore'.

h2o_agg <-  h2o.group_by(data = df_h2o, by = 'Group', max(), gb.control = list(na.methods = "ignore"))

  Group max_ID max_VarA  max_VarB max_VarD
  <int>  <int>    <int>     <dbl>    <int>
1     1      4        3 -1.80e308       16
2     2      7       12  1.4 e  1       14
3     3     10       14  1.6 e  1       16
4     4     12       14  1.6 e  1       16
5     5     16       16  1.6 e  1       16

I am not sure why something as common as completely missing data for a given variable within a specific group is being given a value of -1.80e308? I tried the same workflow in dplyr and got results which match my expectations (but this is not a solution as I cannot process datasets of this size in dplyr, and hence my need for a solution in h2o). I realise dplyr is giving me -inf values rather than NA, and I can easily recode both -1.80e308 and -Inf to NA, but I am trying to make sure that this isn't a symptom of a larger problem in h2o (or that I am not doing something fundamentally wrong in my code when attempting to aggregate in h2o). I also have to aggregate normalised datasets which often have values which are approximately similar to -1.80e308, so I do not want to accidentally recode legitimate values to NA.

library(dplyr)
df %>%
  group_by(Group) %>% 
  summarise(across(everything(), ~max(.x, na.rm = TRUE)))

  Group    ID  VarA  VarB  VarD
  <dbl> <int> <dbl> <dbl> <dbl>
1     1     4     3  -Inf    16
2     2     7    12    14    14
3     3    10    14    16    16
4     4    12    14    16    16
5     5    16    16    16    16

 

Solution

  • This is happening because H2O considers value -Double.MAX_VALUE to be the lowest possible representable floating-point number. This value corresponds to -1.80e308. I agree this is confusing and I would consider this to be a bug. You can file an issue in our bug tracker: https://h2oai.atlassian.net/ (PUBDEV project)