Search code examples
rgroup-byh2osplit-apply-combine

h2o.ai summary functions incompatible with group_by


I am trying to (substantially) accelerate some R code by moving to R+h2o.ai.

I am grouping by a single factor variable but I get error when I try to compute windowed quantiles, skewness, or kurtosis.

Is there a list of summary functions in h2o that are incompatible with the split-apply-combine approach? Does it only apply to sql-analog functions like sum, count, or stdev?

This code fails:

for(i in col_idx_list){
  proc_cols_list <- names(df.hex)[i]

  group_cols_list <- c("group_variable_factor")

  h2o.quantile(x=df.hex[,proc_cols_list])

  temp <- h2o.group_by(data=df.hex,
                       by=group_cols_list,
                       mean(proc_cols_list),
                       var(proc_cols_list),
                       skewness(proc_cols_list),
                       gb.control=list(na.methods="ignore")  )

  if(i ==first_index){
    df_summs <- temp
  } else {
    df_summs <- h2o.cbind(df_summs , temp[,2:ncol(temp)])
  }
}

This code runs fine:

for(i in col_idx_list){
  proc_cols_list <- names(df.hex)[i]

  group_cols_list <- c("group_variable_factor")

  h2o.quantile(x=df.hex[,proc_cols_list])

  temp <- h2o.group_by(data=df.hex,
                       by=group_cols_list,
                       mean(proc_cols_list),
                       var(proc_cols_list),
                       gb.control=list(na.methods="ignore")  )

  if(i ==first_index){
    df_summs <- temp
  } else {
    df_summs <- h2o.cbind(df_summs , temp[,2:ncol(temp)])
  }
}

Error text (truncated for brevity):

ERROR: Unexpected HTTP Status code: 400 Bad Request (url = http://localhost:54321/99/Rapids)
Error in .h2o.doSafeREST(h2oRestApiVersion = h2oRestApiVersion, urlSuffix = page,  : 
ERROR MESSAGE:
No enum constant water.rapids.ast.prims.mungers.AstGroup.FCN.skewness
ERROR: Unexpected HTTP Status code: 404 Not Found (url = http://localhost:54321/3/Frames/RTMP_sid_8712_17?row_count=10)

ERROR MESSAGE:

Object 'RTMP_sid_8712_17' not found for argument: key

Solution

  • The error seems to indicated that skewness is a problem. For a full list of aggregation methods that are allowed in the h2o.group_by() please see the Details section (bottom of the page) of the documentation.

    For your convenience I am adding the Details section here - you can see that skewness is not currently included (if this something you'd be interested in having feel free to create a JIRA ticket):

    Details In the case of na.methods within gb.control, there are three possible settings. "all" will include NAs in computation of functions. "rm" will completely remove all NA fields. "ignore" will remove NAs from the numerator but keep the rows for computational purposes. If a list smaller than the number of columns groups is supplied, the list will be padded by "ignore". Note that to specify a list of column names in the gb.control list, you must add the col.names argument. Similar to na.methods, col.names will pad the list with the default column names if the length is less than the number of colums groups supplied. Supported functions include nrow. This function is required and accepts a string for the name of the generated column. Other supported aggregate functions accept col and na arguments for specifying columns and the handling of NAs ("all", "ignore", and GroupBy object; max calculates the maximum of each column specified in col for each group of a GroupBy object; mean calculates the mean of each column specified in col for each group of a GroupBy object; min calculates the minimum of each column specified in col for each group of a GroupBy object; mode calculates the mode of each column specified in col for each group of a GroupBy object; sd calculates the standard deviation of each column specified in col for each group of a GroupBy object; ss calculates the sum of squares of each column specified in col for each group of a GroupBy object; sum calculates the sum of each column specified in col for each group of a GroupBy object; and var calculates the variance of each column specified in col for each group of a GroupBy object. If an aggregate is provided without a value (for example, as max in sum(col="X1", na="all").mean(col="X5", na="all").max()), then it is assumed that the aggregation should apply to all columns except the GroupBy columns. However, operations will not be performed on String columns. They will be skipped. Note again that nrow is required and cannot be empty.