Search code examples
sqlpostgresqlself-reference

Self-referential CASE WHEN clause in SQL


I'm trying to migrate some poorly formed data into a database. The data comes from a CSV, and is first loaded into a staging table of all varchar columns (as I cannot enforce type safety at this stage).

The data might look like

COL1     | COL2 | COL3
Name 1   |      |     
2/11/16  | $350 | $230
2/12/16  | $420 | $387
2/13/16  | $435 | $727
Name 2   |      |     
2/11/16  | $121 | $144
2/12/16  | $243 | $658
2/13/16  | $453 | $214

The first colum is a mixture of company names as pseudo-headers, and dates for which colum 2 and 3 data is relevant. I'd like to start transforming the data by creating a 'Brand' column - where 'StoreBrand' is the value of Col1 if Col2 is NULL, or the previous row's StoreBrand otherwise. Comething like:

COL1     | COL2 | COL3 | StoreBrand
Name 1   |      |      | Name 1
2/11/16  | $350 | $230 | Name 1
2/12/16  | $420 | $387 | Name 1
2/13/16  | $435 | $727 | Name 1
Name 2   |      |      | Name 2
2/11/16  | $121 | $144 | Name 2
2/12/16  | $243 | $658 | Name 2
2/13/16  | $453 | $214 | Name 2

I wrote this:

SELECT 
    t.*,
    CASE
        WHEN t.COL2 IS NULL THEN COL1
        ELSE                     LAG(StoreBrand) OVER ()
    END AS StoreBrand
FROM
(
    SELECT
        ROW_NUMBER() OVER () AS i,
        *
    FROM
        Staging_Data
) t;

But the database (postgres in this case, but we're considering alternatives so the most diverse answer is preferred) chokes on LAG(StoreBrand) because that's the derived column I'm creating. Invoking LAG(Col1) only populates the first row's real data:

COL1     | COL2 | COL3 | StoreBrand
Name 1   |      |      | Name 1
2/11/16  | $350 | $230 | Name 1
2/12/16  | $420 | $387 | 2/11/16
2/13/16  | $435 | $727 | 2/12/16
Name 2   |      |      | Name 2
2/11/16  | $121 | $144 | Name 2
2/12/16  | $243 | $658 | 2/11/16
2/13/16  | $453 | $214 | 2/12/16

My goal would be a StoreBrand column which is the first value of COL1 for all date values before the next brand name:

COL1     | COL2 | COL3 | StoreBrand
Name 1   |      |      | Name 1
2/11/16  | $350 | $230 | Name 1
2/12/16  | $420 | $387 | Name 1
2/13/16  | $435 | $727 | Name 1
Name 2   |      |      | Name 2
2/11/16  | $121 | $144 | Name 2
2/12/16  | $243 | $658 | Name 2
2/13/16  | $453 | $214 | Name 2

The value of StoreBrand when Col2 and Col3 are null is inconsequential - that row will be dropped as part of the conversion process. The important thing is associating the data rows (i.e. those with dates) with their brand.

Is there a way to reference the previous value for the column that I'm missing?


Solution

  • Edit for people who find this question through a search engine:

    The trick was to use WITH which allows to use a temporary result at several places (link).


    I think this does what you want and discards the null rows at the same time (if you want to). We basically select all the brands before the row we are currently looking at and if no "brand row" exists between it and the current row then we take it.

    WITH t AS
       (SELECT
          ROW_NUMBER() OVER () AS i,
          *
       FROM
          Staging_Data
       )
    SELECT
       a.COL1,
       a.COL2,
       a.COL3,
       (SELECT b.COL1 FROM t b WHERE b.COL2 IS NULL AND b.i <= a.i AND NOT EXISTS(
          SELECT * FROM t c WHERE c.COL2 IS NULL AND c.i <= a.i AND c.i > b.i)
       ) StoreBrand
    FROM
       t a
    WHERE -- I don't think you need those rows? Otherwise remove it.
       a.COL2 IS NOT NULL
    

    It can be a bit confusing. t is a temporary table we defined with your query. And a, b and c are aliases for t. We could also write FROM t AS a to make it more obvious.