Search code examples
postgresqlamazon-web-servicesamazon-redshiftgaps-in-data

Fill the table with data for missing date (postgresql, redshift)


I'm trying to fill daily data for missing dates and can not find an answer, please help.

My daily_table example:

      url          | timestamp_gmt | visitors | hits  | other.. 
-------------------+---------------+----------+-------+-------
 www.domain.com/1  | 2016-04-12    |   1231   | 23423 |
 www.domain.com/1  | 2016-04-13    |   1374   | 26482 |
 www.domain.com/1  | 2016-04-17    |   1262   | 21493 |
 www.domain.com/2  | 2016-05-09    |   2345   | 35471 |          

Expected result: I wand to fill this table with data for every domain and every day which just copy data from previous date:

      url          | timestamp_gmt | visitors | hits  | other.. 
-------------------+---------------+----------+-------+-------
 www.domain.com/1  | 2016-04-12    |   1231   | 23423 |
 www.domain.com/1  | 2016-04-13    |   1374   | 26482 |
 www.domain.com/1  | 2016-04-14    |   1374   | 26482 |     <-added
 www.domain.com/1  | 2016-04-15    |   1374   | 26482 |     <-added
 www.domain.com/1  | 2016-04-16    |   1374   | 26482 |     <-added
 www.domain.com/1  | 2016-04-17    |   1262   | 21493 |
 www.domain.com/2  | 2016-05-09    |   2345   | 35471 |          

I can move a part of the logic into php, but it is undesirable, because my table has billions of missing dates.

SUMMARY:

During a few last days I foud out that:

  1. Amazon-redshift works with 8-th version of PostgreSql, that's why it does not support such a beautiful command like JOIN LATERAL
  2. Redshift also does not support generate_series and CTEs
  3. But it supports simple WITH (thank you @systemjack) but WITH RECURSIVE does not

Solution

  • Finally, I finished my task and I want to share some useful things.

    Instead of generate_series I used this hook:

    WITH date_range AS (
      SELECT trunc(current_date - (row_number() OVER ())) AS date
      FROM any_table  -- any of your table which has enough data
      LIMIT 365
    ) SELECT * FROM date_range;
    

    To get list of URLs which I have to fill with the data I used this:

    WITH url_list AS (
      SELECT
        url AS gapsed_url,
        MIN(timestamp_gmt) AS min_date,
        MAX(timestamp_gmt) AS max_date
      FROM daily_table
      WHERE url IN (
        SELECT url FROM daily_table GROUP BY url
        HAVING count(url) < (MAX(timestamp_gmt) - MIN(timestamp_gmt) + 1)
      )
      GROUP BY url
    ) SELECT * FROM url_list;
    

    Then I combinet given data, let's call it url_mapping:

    SELECT t1.*, t2.gapsed_url FROM date_range AS t1 CROSS JOIN url_list AS t2
    WHERE t1.date <= t2.max_date AND t1.date >= t2.min_date;
    

    And to get data by closest date I did the following:

    SELECT sd.*
    FROM url_mapping AS um JOIN daily_table AS sd
    ON um.gapsed_url = sd.url AND (
      sd.timestamp_gmt = (SELECT max(timestamp_gmt) FROM daily_table WHERE url = sd.url AND timestamp_gmt <= um.date)
    )
    

    I hope it will help someone.