Search code examples
sqlnetezza

SQL: Learning About Gap-And-Island Problems


I am working with Netezza SQL.

In a previous question (Replacing CTE's with Individual Queries) I learned about the basics of "Gap and Island Problems" in which the goal is to "fill" missing records for each name.

Suppose there is a table with the names of different people over different years (some of the years are missing). Lets assume that each person has favorite color, favorite food and favorite sport - and this information DOES NOT change over the years. However, the age of each person DOES change each year.

   CREATE TABLE sample_table 
    (
        name VARCHAR(50),
        age INTEGER,
        year INTEGER,
        color VARCHAR(50),
        food VARCHAR(50),
        sport VARCHAR(50)
    );
    
    INSERT INTO sample_table (name, age, year, color, food)
    VALUES ('aaa', 41,  2010, 'Red', 'Pizza', 'hockey');
    
    INSERT INTO sample_table (name, age, year, color, food)
    VALUES ('aaa', 42,  2012, 'Red', 'Pizza', 'hockey');

    INSERT INTO sample_table (name, age, year, color, food)
    VALUES ('aaa', 47, 2017, 'Red', 'Pizza', 'hockey');
    
    INSERT INTO sample_table (name, age, year, color, food)
    VALUES ('bbb', 20 2000, 'Blue', 'Burgers','football');
    
    INSERT INTO sample_table (name, age,  year, color, food)
    VALUES ('bbb', 26,  2006, 'Blue', 'Burgers', 'football');


    INSERT INTO sample_table (name, age, year, color, food)
    VALUES ('bbb', 30, 2010, 'Blue', 'Burgers', 'football');



+------+-----+------+-------+---------+----------+
| name | age | year | color |  food   |  sport   |
+------+-----+------+-------+---------+----------+
| aaa  |  41 | 2010 | Red   | Pizza   | hockey   |
| aaa  |  42 | 2012 | Red   | Pizza   | hockey   |
| aaa  |  47 | 2017 | Red   | Pizza   | hockey   |
| bbb  |  20 | 2000 | Blue  | Burgers | football |
| bbb  |  26 | 2006 | Blue  | Burgers | football |
| bbb  |  30 | 2010 | Blue  | Burgers | football |
+------+-----+------+-------+---------+----------+

In this problem, I am interested in filling missing information (e.g. age, sport, food, color) only between the min and the max year for each person. Specifically, I would like to learn how to do this problem without CTEs and through a "standard query".

Here is what I have attempted so far:

# https://stackoverflow.com/questions/75677585/replacing-ctes-with-individual-queries
    create table years_table (year integer);
    
    insert into years_table(year)
    values(2010);
    insert into years_table(year)
    values(2011);
    insert into years_table(year)
    values(2012);
    insert into years_table(year)
    values(2013);
    insert into years_table(year)
    values(2014);
    insert into years_table(year)
    values(2015);
    insert into years_table(year)
    values(2016);
    insert into years_table(year)
    values(2017);
    insert into years_table(year)
    values(2018);
    insert into years_table(year)
    values(2019);
    insert into years_table(year)
    values(2020);
    
    
    select name, year, max(color) over(partition by name, grp order by year) color, max(sport) over(partition by name, grp order by year) sport, max(food) over(partition by name, grp order by year) food
    from ( 
        select n.name, y.year, t.color, t.food, t.sport
            sum(case when t.name is null then 0 else 1 end) over(partition by n.name order by y.year) grp
        from (
            select name, min(year) min_year, max(year) max_year
            from sample_table
            group by name
        ) n
        inner join years_table y on y.year between n.min_year and n.max_year
        left join sample_table t on t.name = n.name and t.year = y.year
    ) t

But I am not sure how to adapt this SQL code to make the "age" information for each person change over the years.

Can someone please show me how to do this? Preferably, I would like to learn how to do this without Recursive CTE's as they are not supported in Netezza.

Thanks!

Note: The final result should look something like this:

+------+-----+------+-------+---------+----------+
| name | age | year | color |  food   |  sport   |
+------+-----+------+-------+---------+----------+
| aaa  |  41 | 2010 | Red   | Pizza   | hockey   |
| aaa  |  42 | 2011 | Red   | Pizza   | hockey   |
| aaa  |  42 | 2012 | Red   | Pizza   | hockey   |
| aaa  |  43 | 2013 | Red   | Pizza   | hockey   |
| aaa  |  44 | 2014 | Red   | Pizza   | hockey   |
| aaa  |  45 | 2015 | Red   | Pizza   | hockey   |
| aaa  |  46 | 2016 | Red   | Pizza   | hockey   |
| aaa  |  47 | 2017 | Red   | Pizza   | hockey   |
| bbb  |  20 | 2000 | Blue  | Burgers | football |
| bbb  |  21 | 2001 | Blue  | Burgers | football |
| bbb  |  22 | 2002 | Blue  | Burgers | football |
| bbb  |  23 | 2003 | Blue  | Burgers | football |
| bbb  |  24 | 2004 | Blue  | Burgers | football |
| bbb  |  25 | 2005 | Blue  | Burgers | football |
| bbb  |  26 | 2006 | Blue  | Burgers | football |
| bbb  |  27 | 2007 | Blue  | Burgers | football |
| bbb  |  28 | 2008 | Blue  | Burgers | football |
| bbb  |  29 | 2009 | Blue  | Burgers | football |
| bbb  |  30 | 2010 | Blue  | Burgers | football |
+------+-----+------+-------+---------+----------+

Solution

  • I do not know about Netezza so much but if it supports window functions, you can use a LEAD to get the year of the next record in your table.

    SELECT sample_table.*,
           COALESCE(LEAD(year) OVER (partition by name ORDER BY year) -1, year) AS lastyeartogenerate
    FROM sample_table
    

    Personally, I would stop the work here and let whatever application is in charge of generating the final table take over from there with a nested loop. In pseudo-code:

    for each record in recordset, do {
        for y from year to lastyeartogenerate do {
            [...]
        }
    }
    

    That limits the query complexity and the amount of redundant data to be sent over a network.

    If you still wish to get the expected table you mention right out of the database, simply join with years_table , although it needs to start on year 2000.

    SELECT name, 
           age - t.year + years_table.year AS age,
           years_table.year,
           color,
           food,
           sport
    FROM (
    SELECT sample_table.*,
           COALESCE(LEAD(year) OVER (partition by name ORDER BY year) -1, year) AS lastyeartogenerate
    FROM sample_table
    ) T
    JOIN years_table ON years_table.year BETWEEN T.year AND t.lastyeartogenerate
    ORDER BY name, year
    

    Better yet, replace your years_table with something in the line of what is presented here (I must admit it is slightly above what I know of that DBMS and I have no access to one to test):

    SELECT name, 
           age + idx AS age,
           year + idx AS year,
           color,
           food,
           sport
    FROM (
    SELECT sample_table.*,
           COALESCE(LEAD(year) OVER (partition by name ORDER BY year) -1, year) - year AS indicestogenerate
    FROM sample_table
    ) T
    JOIN _v_vector_idx ON idx <= indicesToGenerate
    ORDER BY name, year + idx
    

    I expect this to work but then again, have no access to a DB to test it.