Search code examples
sqlcountamazon-redshiftgaps-and-islandsdate-arithmetic

How to perform a running count on null values that resets on non-null values


I am trying to calculate a column that performs a running count of consecutive null values, but the running count will reset upon non-null values.

I am currently trying to achieve this on this version of redshift:

PostgreSQL 8.0.2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.4.2 20041017 (Red Hat 3.4.2-6.fc3), Redshift 1.0.8187

I tried using this window function but this is just continuously incrementing the number for each null.

ROW_NUMBER() OVER (PARTITION BY ID, VAL ORDER BY VAL ROWS UNBOUNDED PRECEDING)

For example if I had a dataset like this:

id  | date  | val
----+-------+-------
  1 |   1/1 | NULL
  1 |   1/2 | NULL 
  1 |   1/3 | NULL 
  1 |   1/4 |  1
  1 |   1/5 | NULL 
  1 |   1/6 | NULL 
  1 |   1/7 |  1 
  2 |   1/8 |  2
  2 |   1/9 | NULL
  2 |   1/1 | NULL
  2 |   1/2 |  1
  2 |   1/3 | NULL
  2 |   1/4 |  0
  2 |   1/5 | NULL
  2 |   1/6 | NULL  

I would like the output to look like this:

id  | date  | val   | foo
----+-------+-------+-------
  1 |   1/1 | NULL  |  1
  1 |   1/2 | NULL  |  2
  1 |   1/3 | NULL  |  3
  1 |   1/4 |  1    |
  1 |   1/5 | NULL  |  1 
  1 |   1/6 | NULL  |  2
  1 |   1/7 |  1    |
  2 |   1/8 |  2    |
  2 |   1/9 | NULL  |  1
  2 |   1/1 | NULL  |  2
  2 |   1/2 |  1    |
  2 |   1/3 | NULL  |  1
  2 |   1/4 |  0    |
  2 |   1/5 | NULL  |  1
  2 |   1/6 | NULL  |  2

Solution


  • Starter

    I think that there is a glitch in your sample data, in the below highlighted records:

    id  | date  | val   | foo
    ----+-------+-------+-------
      1 |   1/1 | NULL  |  1
      1 |   1/2 | NULL  |  2
      1 |   1/3 | NULL  |  3
      1 |   1/4 |  1    |
      1 |   1/5 | NULL  |  1 
      1 |   1/6 | NULL  |  2
      1 |   1/7 |  1    |
      2 |   1/8 |  2    |       --> this record is not in sequence
      2 |   1/9 | NULL  |  1    --> neither this one
      2 |   1/1 | NULL  |  2    --> so this record should have foo = 1, not 2
      2 |   1/2 |  1    |
      2 |   1/3 | NULL  |  1
      2 |   1/4 |  0    |
      2 |   1/5 | NULL  |  1
      2 |   1/6 | NULL  |  2
    

    I simply expurged these three records from the dataset. If you are not ok with that, do not read further...


    Answer

    This is a variation of the gaps-and-island problem. To solve it, the idea is to build groups made of consecutive null records. For this, we compute row_number()s over two different partitions (by id vs by id and null/not null val). The difference between the row numbers defines the group.

    Then, all that is left to do is assign new row numbers to each record having a null val within the group it belongs to.

    Query:

    select 
        id,
        date,
        val,
        case when val is null
            then row_number() over(partition by id, rn1 - rn2 order by date) 
            else null
        end foo
    from (
        select
            t.*,
            row_number() 
                over(order by id, date) rn1,
            row_number() 
                over(partition by id, case when val is null then 1 else 0 end order by date ) rn2
        from mytable t
    ) t
    order by id, date   
    

    Demo on DB Fiddle:

    | id  | date                     | val | foo |
    | --- | ------------------------ | --- | --- |
    | 1   | 2019-01-01T00:00:00.000Z |     | 1   |
    | 1   | 2019-01-02T00:00:00.000Z |     | 2   |
    | 1   | 2019-01-03T00:00:00.000Z |     | 3   |
    | 1   | 2019-01-04T00:00:00.000Z | 1   |     |
    | 1   | 2019-01-05T00:00:00.000Z |     | 1   |
    | 1   | 2019-01-06T00:00:00.000Z |     | 2   |
    | 1   | 2019-01-07T00:00:00.000Z | 1   |     |
    | 2   | 2019-01-02T00:00:00.000Z | 1   |     |
    | 2   | 2019-01-03T00:00:00.000Z |     | 1   |
    | 2   | 2019-01-04T00:00:00.000Z | 0   |     |
    | 2   | 2019-01-05T00:00:00.000Z |     | 1   |
    | 2   | 2019-01-06T00:00:00.000Z |     | 2   |