I have a table where each row is a vendor with a sale made on some date.
I'm trying to compute average daily sales per vendor for the year 2019, and get a single number. Which I think means I want to compute an average of averages.
This is the query I'm considering, but it takes a very long time on this large table. Is there a smarter way to compute this average without this much nesting? I have a feeling I'm scanning rows more times than I need to.
-- Average of all vendor's average daily sale counts
SELECT AVG(vendor_avgs.avg_daily_sales) avg_of_avgs
FROM (
-- Get average number of daily sales for each vendor
SELECT vendor_daily_totals.memberdeviceid, AVG(vendor_daily_totals.cnt)
avg_daily_sales
FROM (
-- Get total number of sales for each vendor
SELECT vendorid, COUNT(*) cnt
FROM vendor_sales
WHERE year = 2019
GROUP BY vendorid, month, day
) vendor_daily_totals
GROUP BY vendor_daily_totals.vendorid
) vendor_avgs;
I'm curious if there is in general a way to compute an average of averages more efficiently.
This is running in Impala, by the way.
I think you can just do the calculation in one shot:
SELECT AVG(t.avgs)
FROM (
SELECT vendorid,
COUNT(*) * 1.0 / COUNT(DISTINCT month, day) as avgs
FROM vendor_sales
WHERE year = 2019
GROUP BY vendorid
) t
This gets the total and divides by the number of days. However, COUNT(DISTINCT)
might be even slower than nested GROUP BY
s in Impala, so you need to test this.