Search code examples
sqlt-sqlimpala

Getting the starting point of a cumulative Percent of Total using SQL


I need to return records based on what their "Percent of Total" is. So, for example, I might want to know the "Top 10% of Sales by Customer".

Currently, I'm using this:

SELECT * FROM (SELECT Dim1, SUM(SQ_Fact1) AS Fact1,
(SUM(SUM(SQ_Fact1)) OVER(ORDER BY SUM(SQ_Fact1) DESC))/SUM(SUM(SQ_Fact1)) OVER() AS PctOfTtl
FROM (SELECT Customer AS Dim1, SUM(Sales) AS SQ_Fact1
FROM SalesHistory GROUP BY Customer) AS sq
GROUP BY Dim1) AS mq
WHERE PctOfTtl<=0.10

This is a little overly complicated, but it's part of a query-building UI, so it creates a subquery and "main" outer query for all queries (for reasons unrelated to this), then it has to use another outer query around those to check the PctOfTtl. Regardless, the line that gets the cumulative percent of total (and sorts it descending, so I can get the "top" records) is:

(SUM(SUM(SQ_Fact1)) OVER(ORDER BY SUM(SQ_Fact1) DESC))/SUM(SUM(SQ_Fact1)) OVER() AS PctOfTtl

And then in the outer query, I limit what Percents of Total I return by saying "WHERE PctOfTtl<=0.10"

The problem is, the PctOfTtl tells me where their percent of total ends, not where it starts.

So, lets say the top Customer makes up 15% of the Sales. It will have a PctOfTtl of 0.15. The problem is, if the user asks for Customers in the Top 10% of Sales (<=0.1), it will return no records, because the top-most Customer doesn't "end" until 0.15, so their PctOfTtl won't be <=0.1. What I need to know is where their percent of total starts (i.e. the top-most customer will start at 0, and the second will start at 0.15, etc). What I'm really looking for is any customers whose percent of total "range" falls anywhere IN the percent that the user specifies (even if it extends beyond it).

The problem is, like I said, this is for a query-building UI, so I can't completely restructure the query in order to do this. It has to work within the confines of the existing "subquery/main query" format, which using the "OVER" clause does. I can just check if they're employing a "Top %" filter, and if so, tack on the PctOfTtl column and surround the query with an outer query that limits the PctOfTtl. Basically, what I'm hoping to find is a way to use OVER() that will tell me what the PctOfTtl is minus the PctOfTtl "before" it.


Solution

  • Try subtracting out the current value:

    SELECT *
    FROM (SELECT Dim1, SUM(SQ_Fact1) AS Fact1,
                 SUM(SQ_Fact1) / SUM(SUM(SQ_Fact1)) OVER () as PctOfTtl
                 (SUM(SUM(SQ_Fact1)) OVER (ORDER BY SUM(SQ_Fact1) DESC))/SUM(SUM(SQ_Fact1)) OVER () AS Running_PctOfTtl
          FROM (SELECT Customer AS Dim1, SUM(Sales) SQ_Fact1
                FROM SalesHistory
                GROUP BY Customer
               ) sq
          GROUP BY Dim1
         ) mq
    WHERE Running_PctOfTtl - PctOfTtl < 0.10;