Search code examples
rdbplyr

Avoid arithmetic overflow error when using dbplyr


This question is similar to this one, but in a summarise so the posted answer doesn't quite fit. The data are such that a row identifies a unit-time pair:

large_sql_df
id   t   var1   var2
1    1    10     0
1    2    20     1
2    1    11     0

And I would like to aggregate by var2 and time t:

localdf <- large_sql_df %>%
     group_by(var, t) %>%
     summarise(count = n(), var1_mean = mean(var1))

This gives the error: "Arithmetic overflow error converting expression to data type int." I think this is because count becomes a very large number. Is there a way to stop this from happening without having to do the entire query in SQL?


Solution

  • The arithmetic overflow limit is 2,147,483,647 (see here) so this is probably not caused by count. It is much more likely caused by mean(var1) because to calculate the mean the computer first has to calculate sum(var1). If your var1 column often contains values greater than 40 then an overflow error is likely (40 * 350 million > 2,147,483,647).

    The solution from here is relevant for your question too: Explicitly transform var1 into a more suitable format. Something like the following:

    localdf <- large_sql_df %>%
         mutate(var1 = CAST(var1 AS FLOAT)) %>% 
         group_by(var, t) %>%
         summarise(count = n(), var1_mean = mean(var1))
    

    Depending on your version of dbplyr you might need something like mutate(var1 = sql("CAST(var1 AS FLOAT)")).

    See this question for a discussion of different data types. While float is an inexact data type it can handle numbers up to 10E38. If you are taking the average over such a large total then the impression introduced is unlikely to be significant.