I am working with Netezza SQL
I have the following :
CREATE TABLE sample_table
(
name VARCHAR(50),
age INTEGER,
year INTEGER,
color VARCHAR(50),
food VARCHAR(50),
sport VARCHAR(50),
source VARCHAR(50)
);
INSERT INTO sample_table (name, age, year, color, food, source)
VALUES ('aaa', 41, 2010, 'Red', 'Pizza', 'hockey', 'original');
INSERT INTO sample_table (name, age, year, color, food, , source)
VALUES ('aaa', 42, 2012, 'Red', 'Pizza', 'hockey', 'original');
INSERT INTO sample_table (name, age, year, color, food, source)
VALUES ('aaa', 47, 2017, 'Red', 'Pizza', 'hockey', 'original');
INSERT INTO sample_table (name, age, year, color, food, source)
VALUES ('bbb', 20 2000, 'Blue', 'Burgers','football', 'original');
INSERT INTO sample_table (name, age, year, color, food, source)
VALUES ('bbb', 26, 2006, 'Blue', 'Burgers', 'football', 'original');
INSERT INTO sample_table (name, age, year, color, food, source)
VALUES ('bbb', 30, 2010, 'Blue', 'Burgers', 'football', 'original');
Question: I am interested in filling missing information (e.g. age, sport, food, color, source) only between the min and the max year for each person.
When a row is missing for a person:
First what I did was create a table of years that contains a range between the minimum and maximum years for all years within the table:
INSERT INTO years_table (year) VALUES (2010);
INSERT INTO years_table (year) VALUES (2011);
INSERT INTO years_table (year) VALUES (2012);
INSERT INTO years_table (year) VALUES (2013);
INSERT INTO years_table (year) VALUES (2014);
INSERT INTO years_table (year) VALUES (2015);
INSERT INTO years_table (year) VALUES (2016);
INSERT INTO years_table (year) VALUES (2017);
INSERT INTO years_table (year) VALUES (2018);
INSERT INTO years_table (year) VALUES (2019);
INSERT INTO years_table (year) VALUES (2020);
Then, I tried to modify the answer provided here (SQL: Learning About Gap-And-Island Problems) :
SELECT name,
age - t.year + years_table.year AS age,
years_table.year,
color,
food,
sport,
CASE WHEN T.year = (SELECT MAX(year) FROM sample_table) THEN 'NEW' ELSE 'ORIGINAL' END AS source
FROM (
SELECT sample_table.*,
COALESCE(LEAD(year) OVER (partition by name ORDER BY year) -1, year) AS lastyeartogenerate
FROM sample_table
) T
JOIN years_table ON years_table.year BETWEEN T.year AND t.lastyeartogenerate
ORDER BY name, year;
Problem: But I am not sure if I am using the CASE WHEN statement properly. The query runs, but I can see that some newly added rows have a value of SOURCE = ORIGINAL when they should have SOURCE = NEW.
Can someone please show me how to fix this?
Thanks!
There are issues in your code (source
is not declared on your table CREATE
command, the INSERT
will not work because of it and because you are missing the sport
column in them). From your attempt at modifying my previous query, I will assume the source
column does not exist.
The important bits in the query are COALESCE(LEAD(year) OVER (partition by name ORDER BY year) -1, year)
and years_table.year BETWEEN t.year AND t.lastyeartogenerate
with t.year
being the original sample_table.year
. That is basically where the term missing information is defined in the code.
With that it mind, it is easy to deduce:
SOURCE = ORIGINAL
when years_table.year = t.year
years_table.year = sample_table.year
.SOURCE = NEW
when years_table.year > t.year
.LEAD(...) - 1
and the JOIN
, all the values of year_table.year
between t.year + 1
(= sample_table.year + 1
) and t.lastyeartogenerate
(both bound included) are absent from sample_table
.Therefore, your CASE WHEN
should be:
CASE WHEN years_table.year = t.year THEN 'ORIGINAL' ELSE 'NEW' END
For the alternative and IMO better version of the query I had provided (the one that uses _v_vector_idx
):
CASE WHEN idx = 0 THEN 'ORIGINAL' ELSE 'NEW' END
Alternative solution: you can scrap the CASE WHEN
entirely and reuse the query in a UNION ALL
. Note that the JOIN
must be changed in the code query to exclude the original records (they come from before the union):
SELECT *, 'ORIGINAL' AS Source
FROM sample_table
UNION ALL
SELECT name,
age - t.year + years_table.year,
years_table.year,
color,
food,
sport,
'NEW'
FROM (
SELECT sample_table.*,
COALESCE(LEAD(year) OVER (partition by name ORDER BY year) -1, year) AS lastyeartogenerate
FROM sample_table
) T
JOIN years_table ON years_table.year BETWEEN T.year + 1 AND t.lastyeartogenerate
ORDER BY name, year
For the other query, it means you have to change the JOIN
to
ON idx BETWEEN 1 AND indicesToGenerate