Search code examples
sqlsqliterangebins

Get bins range from temporary table SQL


I have a question related to my previous one.

What I have is a database that looks like:

category price  date
-------------------------    
Cat1       37   2019-03
Cat2       65   2019-03
Cat3       34   2019-03
Cat1       45   2019-03
Cat2       100  2019-03
Cat3       60   2019-03

This db has hundred of categories and comes from another one that has different attributes for each observation.

With this code:

WITH table AS 
(
    SELECT  
        category, price, date, 
        substring(date, 1, 4) AS year, 
        substring(date, 6, 2) as month
    FROM 
        original_table
    WHERE 
        (year = "2019" or year = "2020") 
        AND (month = "03") 
        AND product = "XXXXX"
    ORDER BY 
        anno
)
-- I get this from a bigger table, but prefer to make small steps 
-- that anyone in the fute can understand where this comes from as 
-- the original table is expected to grow fast
SELECT 
    category,
    ROUND(1.0 * next_price/ price - 1, 2) Pct_change,
    SUBSTR(Date, 1, 4) || '-' || SUBSTR(next_date, 1, 4) Period,
    tipo_establecimiento
FROM 
    (SELECT 
         *,
         LEAD(Price) OVER (PARTITION BY category ORDER BY year) next_price,
         LEAD(year) OVER (PARTITION BY category ORDER BY year) next_date,
         CASE 
            WHEN (category_2>= 35) AND (category_2 <= 61)
                THEN 'S'
            ELSE 'N'
         END 'tipo_establecimiento'
     FROM 
         table)
WHERE 
    next_date IS NOT NULL AND Pct_change >= 0
ORDER BY 
    Pct_change DESC

This code gets me a view of the data that looks like:

category  Pct_change  period
cat1       0.21       2019-2020
cat2       0.53       2019-2020
cat3       0.76           "

This is great! But my next view has to take this one and provide me with a range that shows how many categories are in each range.

It should look like:

range       avg      num_cat_in
[0.1- 0.4]   0.3       3

This last table is just an example of what I expect

I have been trying with a code that looks like this but i get nothing

   WITH table AS (
                SELECT  category,  price, date, substring(date, 1, 4) AS year, substring(date, 6, 2) as month
                FROM original_table
                WHERE (year= "2019" or year= "2020") and (month= "03") and product = "XXXXX"

                order by anno
)
-- I get this from a bigger table, but prefer to make small steps that anyone in the future can understand where this comes from as the original table is expected to grow fast
SELECT category,
       ROUND(1.0 * next_price/ price - 1, 2) Pct_change,
       SUBSTR(Date, 1, 4) || '-' || SUBSTR(next_date, 1, 4) Period,
       tipo_establecimiento
FROM (
  SELECT *,
         LEAD(Price) OVER (PARTITION BY category ORDER BY year) next_price,
         LEAD(year) OVER (PARTITION BY category ORDER BY year) next_date,
         CASE 
        WHEN (category_2>= 35) AND (category_2 <= 61)
            THEN 'S'
        ELSE 'N'
    END 'tipo_establecimiento'
  FROM table
)
WHERE next_date IS NOT NULL AND Pct_change>=0
ORDER BY Pct_change DESC
WHERE next_date IS NOT NULL AND Pct_change>=0

)
SELECT 

count(CASE WHEN Pct_change> 0.12 AND Pct_change <= 0.22 THEN 1 END) AS [12 - 22],
count(CASE WHEN Pct_change> 0.22 AND Pct_change <= 0.32 THEN 1 END) AS [22 - 32],
count(CASE WHEN Pct_change> 0.32 AND Pct_change <= 0.42 THEN 1 END) AS [32 - 42],
count(CASE WHEN Pct_change> 0.42 AND Pct_change <= 0.52 THEN 1 END) AS [42 - 52],
count(CASE WHEN Pct_change> 0.52 AND Pct_change <= 0.62 THEN 1 END) AS [52 - 62],
count(CASE WHEN Pct_change> 0.62 AND Pct_change <= 0.72 THEN 1 END) AS [62 - 72],
count(CASE WHEN Pct_change> 0.72 AND Pct_change <= 0.82 THEN 1 END) AS [72 - 82]

Thank you!!!


Solution

  • cf. my comment, I'm first assuming that your ranges are not hard-coded and that you wish to evenly split your data across quantiles of Prc_change. What this means is the calculation will figure out the ranges which split your sample as uniformly as possible. In this case, the following would work (where theview is the name of your previous view which calculates percentages):

    select
      concat('[',min(Pct_change),'-',min(Pct_change),']') as `range`
      , avg(Pct_change) as `avg`
      , count(*) as num_cat_in
    from(
      select *
        , ntile(5)over(order by Pct_change) as bin
      from theview
    ) t
    group by bin
    order by bin;
    

    Here is a fiddle.


    If on the other hand your ranges are hard-coded, I assume the ranges are in a table such as the one I create:

    create table theranges (lower DOUBLE, upper DOUBLE);
    insert into theranges values (0,0.2),(0.2,0.4),(0.4,0.6),(0.6,0.8),(0.8,1);
    

    (You have to make sure that the ranges are non-overlapping. By convention I include percentages in the range from the lower bound included to the upper bound excluded, except for the upper bound of 1 which is included.) It is then a matter of left-joining the tables:

    select
      concat('[',lower,'-',upper,']') as `range`
      , avg(Pct_change) as `avg`
      , sum(if(Pct_change is null, 0, 1)) as num_cat_in
    from theranges left join theview on (Pct_change>=lower and if(upper=1,true,Pct_change<upper))
    group by lower, upper
    order by lower;
    

    (Note that in the bit that says upper=1, you must change 1 to whatever your highest hard-coded range is; here I am assuming your percentages are between 0 and 1.)

    Here is the second fiddle.