Search code examples
sqlapache-spark-sqlsql-order-bypartitionrow-number

How to filter out in SQL the records in a partionned ordered table where first records of group are null?


The Data

    ROW  YEAR  PROD   KEY   DATE
    1    2011  APPLE  TIME  2011-11-18 00:00:00.000
    2    2011  APPLE  TIME  2011-11-19 00:00:00.000
    3    2013  APPLE  NULL  2011-11-18 00:00:00.000
    4    2013  APPLE  NULL  2011-11-19 00:00:00.000
    5    2013  APPLE  TIME  2014-04-08 00:00:00.000
    6    2013  APPLE  DIM   2014-04-09 00:00:00.000
    7    2013  APPLE  TIME  2014-11-10 10:50:14.113
    8    2013  APPLE  TIME  2014-11-12 10:46:04.947
    9    2013  MELON  JAK   2011-10-17 11:01:19.657
    10   2013  MELON  TIME  2014-11-18 11:19:35.547
    11   2013  MELON  NULL  2014-11-19 11:19:35.547
    12   2013  MELON  TIME  2014-11-21 10:32:36.017
    13   2014  APPLE  JAK   2003-04-10 00:00:00.000
    14   2014  APPLE  DIM   2003-04-11 00:00:00.000
    15   2015  APPLE  TIME  2002-09-27 00:00:00.000
    16   2015  APPLE  NULL  2004-09-28 00:00:00.000

ROW is not a column in the table. Is just to show which records i want.

The question

The above data is partitionned by (YEAR, PROD) and ordered by DATE.

I need to keep all the rows except row 3 and 4 based on the following logic :

  • if the first rows of a group (here (YEAR, PROD)) are NULL, discard them.
  • 11 and 16 are null but we keep them because they are not first of their group.

Each group has to start with records that have a KEY that is are not null

==> otherwise discard

In other words, i can have : not null, null, not null, null

But i cannot have : null, not null, null, not null

Expected result

    ROW  YEAR  PROD   KEY   DATE
    1    2011  APPLE  TIME  2011-11-18 00:00:00.000
    2    2011  APPLE  TIME  2011-11-19 00:00:00.000

    5    2013  APPLE  TIME  2014-04-08 00:00:00.000
    6    2013  APPLE  DIM   2014-04-09 00:00:00.000
    7    2013  APPLE  TIME  2014-11-10 10:50:14.113
    8    2013  APPLE  TIME  2014-11-12 10:46:04.947
    9    2013  MELON  JAK   2011-10-17 11:01:19.657
    10   2013  MELON  TIME  2014-11-18 11:19:35.547
    11   2013  MELON  TIME  2014-11-19 11:19:35.547
    12   2013  MELON  TIME  2014-11-21 10:32:36.017
    13   2014  APPLE  JAK   2003-04-10 00:00:00.000
    14   2014  APPLE  DIM   2003-04-11 00:00:00.000
    15   2015  APPLE  TIME  2002-09-27 00:00:00.000
    16   2015  APPLE  TIME  2004-09-28 00:00:00.000

I want to do that, so later i have always a non null key at the begginning of each group. In that way, i can later always use the former row to fill a subsequent records which have null value (in this example 11 and 16)

Any observation or suggestion would be much appreciated !


Solution

  • The following gets the output you desire. I am checking of the value of key column between rows unbounded preceeding and current row, and since NULL has the highest rank, if there are preceeding rows that are not null it would populate the field min_val with a NOT NULL column.

    select * from (
    select year,prod,key1,date1
           ,min(key1) over(partition by year,prod order by date1 asc) as min_val
      from t
       )x
    where x.min_val is not null   
    
    
    +------+-------+------+-------------------------+---------+
    | year | prod  | key1 |          date1          | min_val |
    +------+-------+------+-------------------------+---------+
    | 2011 | APPLE | TIME | 2011-11-18 00:00:00.000 | TIME    |
    | 2011 | APPLE | TIME | 2011-11-19 00:00:00.000 | TIME    |
    | 2013 | APPLE | TIME | 2014-04-08 00:00:00.000 | TIME    |
    | 2013 | APPLE | DIM  | 2014-04-09 00:00:00.000 | DIM     |
    | 2013 | APPLE | TIME | 2014-11-10 10:50:14.113 | DIM     |
    | 2013 | APPLE | TIME | 2014-11-12 10:46:04.947 | DIM     |
    | 2013 | MELON | JAK  | 2011-10-17 11:01:19.657 | JAK     |
    | 2013 | MELON | TIME | 2014-11-18 11:19:35.547 | JAK     |
    | 2013 | MELON |      | 2014-11-19 11:19:35.547 | JAK     |
    | 2013 | MELON | TIME | 2014-11-21 10:32:36.017 | JAK     |
    | 2014 | APPLE | JAK  | 2003-04-10 00:00:00.000 | JAK     |
    | 2014 | APPLE | DIM  | 2003-04-11 00:00:00.000 | DIM     |
    | 2015 | APPLE | TIME | 2002-09-27 00:00:00.000 | TIME    |
    | 2015 | APPLE |      | 2004-09-28 00:00:00.000 | TIME    |
    +------+-------+------+-------------------------+---------+
    

    link https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=ae82f64802674aa60005b8e9f534a150