Search code examples
sqlmathrdbms

How does the Average function work in relational databases?


I'm trying to find geometric average of values from a table with millions of rows. For those that don't know, to find the geometric average, you mulitply each value times each other then divide by the number of rows.

You probably already see the problem; The number multiplied number will quickly exceed the maximum allowed system maximum. I found a great solution that uses the natural log.

http://timothychenallen.blogspot.com/2006/03/sql-calculating-geometric-mean-geomean.html

However that got me to wonder wouldn't the same problem apply with the arithmetic mean? If you have N records, and N is very large the running sum can also exceed the system maximum.

So how do RDMS calculate averages during queries?


Solution

  • Very easy to check. For example, SQL Server 2008.

    DECLARE @T TABLE(i int);
    
    INSERT INTO @T(i) VALUES
    (2147483647),
    (2147483647);
    
    SELECT AVG(i) FROM @T;
    

    result

    (2 row(s) affected)
    Msg 8115, Level 16, State 2, Line 7
    Arithmetic overflow error converting expression to data type int.
    

    There is no magic. Column type is int, server adds values together using internal variable of the same type int and intermediary result exceeds range for int.

    You can run the similar check for any other DBMS that you use. Different engines may behave differently, but I would expect all of them to stick to the original type of the column. For example, averaging two int values 100 and 101 may result in 100 or 101 (still int), but never 100.5.

    For SQL Server this behavior is documented. I would expect something similar for all other engines:

    AVG () computes the average of a set of values by dividing the sum of those values by the count of nonnull values. If the sum exceeds the maximum value for the data type of the return value an error will be returned.

    So, you have to be careful when calculating simple average as well, not just product.


    Here is extract from SQL 92 Standard:

    6) Let DT be the data type of the < value expression >.

    9) If SUM or AVG is specified, then:

    a) DT shall not be character string, bit string, or datetime.

    b) If SUM is specified and DT is exact numeric with scale S, then the data type of the result is exact numeric with implementation-defined precision and scale S.

    c) If AVG is specified and DT is exact numeric, then the data type of the result is exact numeric with implementation- defined precision not less than the precision of DT and implementation-defined scale not less than the scale of DT.

    d) If DT is approximate numeric, then the data type of the result is approximate numeric with implementation-defined precision not less than the precision of DT.

    e) If DT is interval, then the data type of the result is inter- val with the same precision as DT.

    So, DBMS can convert int to larger type when calculating AVG, but it has to be an exact numeric type, not floating-point. In any case, depending on the values you can still get arithmetic overflow.