Search code examples
db2

Replacing Null Values in a Table with Values from other Table


Here are two tables:

table_a:

name year  var
---------------
john 2010    a
john 2011    a
john 2012    c
alex 2020    b
alex 2021    c
tim  2015 NULL
tim  2016 NULL
joe  2010 NULL
joe  2011 NULL
jessica 2000 NULL
jessica 2001 NULL

table_b

    name year var
    --------------
    sara 2001   a
    sara 2002   b
     tim 2005   c
     tim 2006   d
     tim 2021   f
 jessica 2020   z

Here is what I am trying to accomplish:

  • Take names that have NULL values in table_a
  • See if these names appear in table_b
  • If yes, then see if the name (table_a) has a row in table_b with a year (table_b) that occurs BEFORE the year in table_a
  • If yes, replace the NULL in table_a with the value of var (table_b) that is closest to the earliest year (table_a)
  • Otherwise, keep as is

I tried this:

 WITH min_year AS (
    SELECT name, MIN(year) as min_year
    FROM table_a
    GROUP BY name
  ),
  b_filtered AS (
    SELECT b.name, MAX(b.year) as year, b.var
    FROM table_b b
    INNER JOIN min_year m ON b.name = m.name AND b.year < m.min_year
    GROUP BY b.name
  )
  SELECT a.name, a.year, 
    CASE 
      WHEN a.var IS NULL AND b.name IS NOT NULL THEN b.var
      ELSE a.var
    END as var_mod
  FROM table_a a
  LEFT JOIN b_filtered b
  ON a.name = b.name;

But I got the wrong output:

 name year var_mod
    john 2010       a
    john 2011       a
    john 2012       c
    alex 2020       b
    alex 2021       c
     tim 2015    NULL
     tim 2016    NULL
     joe 2010    NULL
     joe 2011    NULL
 jessica 2000    NULL
 jessica 2001    NULL

Correct output should be:

name year var_mod
    john 2010       a
    john 2011       a
    john 2012       c
    alex 2020       b
    alex 2021       c
     tim 2015       d
     tim 2016       d
     joe 2010    NULL
     joe 2011    NULL
 jessica 2000    NULL
 jessica 2001    NULL

How can I correct it?


Solution

  • Use the handy TABLE clause, which allows you to use a "parameterized" sub-select as below.

    WITH 
      table_a (name, year, var) AS 
      (
        VALUES
          ('john',    2010, 'a' )
        , ('john',    2011, 'a' )
        , ('john',    2012, 'c' )
        , ('alex',    2020, 'b' )
        , ('alex',    2021, 'c' )
        , ('tim',     2015, NULL)
        , ('tim',     2016, NULL)
        , ('joe',     2010, NULL)
        , ('joe',     2011, NULL)
        , ('jessica', 2000, NULL)
        , ('jessica', 2001, NULL)
      )
    , table_b (name, year, var) AS
      (
        VALUES
          ('sara',    2001, 'a')
        , ('sara',    2002, 'b')
        , ('tim',     2005, 'c')
        , ('tim',     2006, 'd')
        , ('tim',     2021, 'f')
        , ('jessica', 2020, 'z')
      )
    SELECT a.NAME, a.YEAR, COALESCE (a.var, b.var) AS VAR_MOD
    FROM table_a a
    LEFT JOIN TABLE 
    (
      SELECT b.var
      FROM table_b b
      WHERE b.name = a.name and b.year < a.year
      ORDER BY b.year DESC
      FETCH FIRST 1 ROW ONLY
    ) b ON 1=1
    

    The result is:

    NAME YEAR VAR_MOD
    john 2010 a
    john 2011 a
    john 2012 c
    alex 2020 b
    alex 2021 c
    tim 2015 d
    tim 2016 d
    joe 2010
    joe 2011
    jessica 2000
    jessica 2001

    Update:

    Another much less efficient way with a number of additional the same table references:

    WITH b_filtered AS 
     (
       SELECT a.NAME, MAX (b.YEAR) AS YEAR
       FROM table_a a
       JOIN table_b b ON b.NAME = a.NAME AND b.YEAR < a.YEAR
       GROUP BY a.NAME
     )  
    SELECT a.NAME, a.YEAR, COALESCE (a.var, b.var) AS VAR_MOD
    FROM table_a a
    LEFT JOIN 
    (
      b_filtered f 
      JOIN table_b b ON b.NAME = f.NAME AND b.YEAR = f.YEAR
    ) ON f.NAME = a.NAME