Search code examples
apache-sparkpysparkapache-spark-sqlcase

How to divide the number field infinitely by fixed step in Spark SQL?


I have a table named 'numExample' with only one field like this:

| number |
| 6 |
| 7 |
| 9 |
| 12 |
| 12 |
| 18 |
| 13 |
| 23 |
| 16 |
| 20 |
| 21 |
| 50 |
| 6 |
| ... |
| 400 |
| ... |

The ... means there may be multiple numbers and the upper limit is uncertain. Now I have to calculate how many times the numbers in each interval appear, the interval is left closed and right open, and the step size is fixed to 5, eg: [0, 5)、[5, 10). I want to use CASE clause but the upper limit is uncertain, so I can't list all WHEN clause.

SELECT numInterval, count(DISTINCT number) AS count
FROM (
    SELECT
        number,
        CASE
            WHEN number >=0 AND number < 5 THEN '[0, 5)'
            WHEN number >=5 AND number < 10 THEN '[5, 10)'
            WHEN number >=10 AND number < 15 THEN '[10, 15)'
            ...
            ...
        END AS numInterval
    FROM numExample
)
GROUP BY numInterval;

Both Spark SQL code and PySpark code are acceptable. I can't thank you enough for any help.


Solution

    1. Calculate the left of interval: floor(number/5)*5
    2. Group: group by intervalLeft, so we can get the result
    3. Format the group name: concat( "[", intervalLeft, ",", intervalLeft + 5, ")" )

    Here is the example code:

    SELECT
        intervalLeft,
        concat( "[", intervalLeft, ",", intervalLeft + 5, ")" ) AS numInterval,
        count( DISTINCT number ) AS count 
    FROM
        ( SELECT number, floor( number / 5 )* 5 AS intervalLeft FROM numExample ) 
    GROUP BY
        intervalLeft;
    

    Hope this helps.