Search code examples
sqlhadoopquery-optimizationimpala

Optimize Average of Averages SQL Query


I have a table where each row is a vendor with a sale made on some date.

I'm trying to compute average daily sales per vendor for the year 2019, and get a single number. Which I think means I want to compute an average of averages.

This is the query I'm considering, but it takes a very long time on this large table. Is there a smarter way to compute this average without this much nesting? I have a feeling I'm scanning rows more times than I need to.

-- Average of all vendor's average daily sale counts
SELECT AVG(vendor_avgs.avg_daily_sales) avg_of_avgs
FROM (
        -- Get average number of daily sales for each vendor
        SELECT vendor_daily_totals.memberdeviceid, AVG(vendor_daily_totals.cnt) 
avg_daily_sales
        FROM (
                -- Get total number of sales for each vendor
                SELECT vendorid, COUNT(*) cnt
                FROM vendor_sales
                WHERE year = 2019
                GROUP BY vendorid, month, day
        ) vendor_daily_totals
        GROUP BY vendor_daily_totals.vendorid
) vendor_avgs;

I'm curious if there is in general a way to compute an average of averages more efficiently.

This is running in Impala, by the way.


Solution

  • I think you can just do the calculation in one shot:

    SELECT AVG(t.avgs)
    FROM (
        SELECT vendorid,
               COUNT(*) * 1.0 / COUNT(DISTINCT month, day) as avgs
        FROM vendor_sales
        WHERE year = 2019
        GROUP BY vendorid
    ) t
    

    This gets the total and divides by the number of days. However, COUNT(DISTINCT) might be even slower than nested GROUP BYs in Impala, so you need to test this.