Search code examples
sqldatabasecase

SQL: Using CASE WHEN for Missing Rows


I am working with Netezza SQL

I have the following :

CREATE TABLE sample_table 
(
    name VARCHAR(50),
    age INTEGER,
    year INTEGER,
    color VARCHAR(50),
    food VARCHAR(50),
    sport VARCHAR(50),
    source VARCHAR(50)
);
    
INSERT INTO sample_table (name, age, year, color, food, source)
VALUES ('aaa', 41,  2010, 'Red', 'Pizza', 'hockey', 'original');
    
INSERT INTO sample_table (name, age, year, color, food, , source)
VALUES ('aaa', 42,  2012, 'Red', 'Pizza', 'hockey',  'original');

INSERT INTO sample_table (name, age, year, color, food,  source)
VALUES ('aaa', 47, 2017, 'Red', 'Pizza', 'hockey',  'original');
    
INSERT INTO sample_table (name, age, year, color, food, source)
VALUES ('bbb', 20 2000, 'Blue', 'Burgers','football',  'original');
    
INSERT INTO sample_table (name, age,  year, color, food, source)
VALUES ('bbb', 26,  2006, 'Blue', 'Burgers', 'football',  'original');

INSERT INTO sample_table (name, age, year, color, food, source)
VALUES ('bbb', 30, 2010, 'Blue', 'Burgers', 'football',  'original');

Question: I am interested in filling missing information (e.g. age, sport, food, color, source) only between the min and the max year for each person.

When a row is missing for a person:

  • sport, food, color - these always stay the same for the same person
  • age increases by +1 relative to the last row
  • source = "missing"

First what I did was create a table of years that contains a range between the minimum and maximum years for all years within the table:

INSERT INTO years_table (year) VALUES (2010);
INSERT INTO years_table (year) VALUES (2011);
INSERT INTO years_table (year) VALUES (2012);
INSERT INTO years_table (year) VALUES (2013);
INSERT INTO years_table (year) VALUES (2014);
INSERT INTO years_table (year) VALUES (2015);
INSERT INTO years_table (year) VALUES (2016);
INSERT INTO years_table (year) VALUES (2017);
INSERT INTO years_table (year) VALUES (2018);
INSERT INTO years_table (year) VALUES (2019);
INSERT INTO years_table (year) VALUES (2020);

Then, I tried to modify the answer provided here (SQL: Learning About Gap-And-Island Problems) :

SELECT name,
       age - t.year + years_table.year AS age,
       years_table.year,
       color,
       food,
       sport,
       CASE WHEN T.year = (SELECT MAX(year) FROM sample_table) THEN 'NEW' ELSE 'ORIGINAL' END AS source
FROM (
SELECT sample_table.*,
       COALESCE(LEAD(year) OVER (partition by name ORDER BY year) -1, year) AS lastyeartogenerate
FROM sample_table
) T
JOIN years_table ON years_table.year BETWEEN T.year AND t.lastyeartogenerate
ORDER BY name, year;

Problem: But I am not sure if I am using the CASE WHEN statement properly. The query runs, but I can see that some newly added rows have a value of SOURCE = ORIGINAL when they should have SOURCE = NEW.

Can someone please show me how to fix this?

Thanks!


Solution

  • There are issues in your code (source is not declared on your table CREATE command, the INSERT will not work because of it and because you are missing the sport column in them). From your attempt at modifying my previous query, I will assume the source column does not exist.

    The important bits in the query are COALESCE(LEAD(year) OVER (partition by name ORDER BY year) -1, year) and years_table.year BETWEEN t.year AND t.lastyeartogenerate with t.year being the original sample_table.year. That is basically where the term missing information is defined in the code.
    With that it mind, it is easy to deduce:

    • SOURCE = ORIGINAL when years_table.year = t.year
      Reminder: as per the above, this simply means years_table.year = sample_table.year.
    • Oppositely, SOURCE = NEW when years_table.year > t.year.
      Reminder: Thanks to the LEAD(...) - 1 and the JOIN, all the values of year_table.year between t.year + 1 (= sample_table.year + 1) and t.lastyeartogenerate (both bound included) are absent from sample_table.

    Therefore, your CASE WHEN should be:

    CASE WHEN years_table.year = t.year THEN 'ORIGINAL' ELSE 'NEW' END
    

    For the alternative and IMO better version of the query I had provided (the one that uses _v_vector_idx):

    CASE WHEN idx = 0 THEN 'ORIGINAL' ELSE 'NEW' END
    

    Alternative solution: you can scrap the CASE WHEN entirely and reuse the query in a UNION ALL. Note that the JOIN must be changed in the code query to exclude the original records (they come from before the union):

    SELECT *, 'ORIGINAL' AS Source
    FROM sample_table
    UNION ALL
    SELECT name,
           age - t.year + years_table.year,
           years_table.year,
           color,
           food,
           sport,
           'NEW'
    FROM (  
    SELECT sample_table.*,
           COALESCE(LEAD(year) OVER (partition by name ORDER BY year) -1, year) AS lastyeartogenerate
    FROM sample_table
    ) T
    JOIN years_table ON years_table.year BETWEEN T.year + 1 AND t.lastyeartogenerate
    ORDER BY name, year
    

    For the other query, it means you have to change the JOIN to

    ON idx BETWEEN 1 AND indicesToGenerate