Search code examples
sqlmysqljoinresamplingcross-join

How to resample an SQL database


I have a dataset which looks like below:

ITEM    CITY        START_Y   START_W   FIRST_USE_Y   FIRST_USE_W   VALUE
A       NEW YORK    2023      30             2023             32    15000
A       LONDON      2024       2             2024              2    12000
A       LONDON      2024       2             2024              5    50000
B       NEW YORK    2023      49             2024              1    19540
B       MADRID      2023      10             2023             11    15444

First, the combination of ITEM and CITY need to be grouped. Then for each group, I want to resample weekly up to 5 datapoints and fill the 'VALUE' column with zero where there is no value for the combination of FIRST_USE_Y and FIRST_USE_W columns. START_W and FIRST_USE_W are the number of week of the year(value can be from 1 to 52).

I tried with pandas and a for loop; it worked. But as it is a very large dataset with many millions of rows and I am bound to do with SQL(where I am a newbie). This is the code I tried:

WITH RECURSIVE weekly_intervals AS (
    SELECT MIN(start_w) AS start_w, MAX(start_w) AS end_w
    FROM citywise_values
    UNION ALL
    SELECT start_w + INTERVAL 1 WEEK, end_w
    FROM weekly_intervals
    WHERE start_w + INTERVAL 1 WEEK <= end_w
),
filled_values AS (
    SELECT 
        w.item,
        w.city,
        w.start_y,
        w.start_w,
        COALESCE(cv.value, 0) AS value
    FROM 
        (SELECT 
            item,
            city,
            start_y,
            start_w
        FROM 
            citywise_values
        GROUP BY 
            item, city) w
    LEFT JOIN 
        citywise_values cv ON w.item = cv.item
                             AND w.city = cv.city
                             AND w.start_y = cv.start_y
                             AND w.start_w = cv.start_w
)
SELECT 
    item,
    city,
    start_y,
    start_w,
    COALESCE(value, LAG(value) OVER (PARTITION BY item, city, start_y ORDER BY start_w)) AS value
FROM 
    filled_values
RIGHT JOIN
    weekly_intervals
ON
    filled_values.start_w = weekly_intervals.start_w
ORDER BY
    item, city, start_y, start_w

Then I tried with a cross join and was able to produce the result only for one single combination of the ITEM and CITY. But I could not find how to do for the whole dataset.

I am not sure that I could explain it well or not. So, I am posting the desired output which I created manually.

ITEM    CITY        START_Y     START_W     FIRST_USE_Y     FIRST_USE_W     VALUE
A       NEW YORK    2023        30                2023              30      0
A       NEW YORK    2023        30                2023              31      0
A       NEW YORK    2023        30                2023              32      15000
A       NEW YORK    2023        30                2023              33      0
A       NEW YORK    2023        30                2023              34      0
A       LONDON      2024        2                 2024              2       12000
A       LONDON      2024        2                 2024              3       0
A       LONDON      2024        2                 2024              4       0
A       LONDON      2024        2                 2024              5       50000
A       LONDON      2024        2                 2024              6       0
B       NEW YORK    2023        49                2023              49      0
B       NEW YORK    2023        49                2023              50      0
B       NEW YORK    2023        49                2023              51      0
B       NEW YORK    2023        49                2023              52      0
B       NEW YORK    2023        49                2024              1       19540
B       MADRID      2023        10                2023              10      0
B       MADRID      2023        10                2023              11      15444
B       MADRID      2023        10                2023              12      0
B       MADRID      2023        10                2023              13      0
B       MADRID      2023        10                2023              14      0

Any help will be appreciated.


Solution

  • There are three items here that make this query tricky:

    1. Projecting out to 5 rows (I used a table-value constructor, but there are other options, including generate_series() in SQL Server 2022 or a recursive CTE)
    2. Handling when the year wraps around. The trick here is dates are messy. Avoid doing date math work yourself. Instead, lean as much as possible on the date features built into the platform. In this case, it means converting the Year/Week values into an actual date... which is one reason I recommend storing things this way in the first place. You could store these values a dates instead, where the actual day is the Sunday value for that week.
    3. In spite of #2, I still had to make one concession to manual date math: since the year doesn't always start on the same day and the week number in the data don't seem to always match either the "regular" or iso_week returned by SQL Server, possibly due to an error with the sample data, I treated week numbers as 7-day blocks since January 1 manually.

    I also see this:

    START_W and FIRST_USE_W are the number of week of the year(value can be from 1 to 52).

    There are more than 52 weeks in a year!

    Every year will have a partial week number 53 with at least one or two days. You need to be able to account for this.

    I came up with this, which even used a CROSS JOIN:

    WITH ItemCity As (
        SELECT Item, City, MIN(  DATEADD(day, Start_W*7, DATEFROMPARTS(Start_Y, 1, 1)) ) As StartWeek
        FROM Data
        GROUP BY Item, City
    ), 
    ItemCityWeeks As (
       SELECT Item,City, StartWeek
           ,Year(StartWeek) As Start_Y,datepart(week, StartWeek)-1 As Start_W
           ,YEAR(DATEADD(day, Weeks.num*7, StartWeek)) As First_Use_Y
           ,DATEPART(dayofyear, DATEADD(day, Weeks.num*7, StartWeek))/7 As First_Use_W
       FROM ItemCity
       CROSS JOIN ( VALUES (0), (1), (2), (3), (4)) Weeks(num)
    )
    SELECT icw.Item, icw.City
          , icw.Start_Y, icw.Start_W, icw.First_Use_Y, icw.First_Use_W
          , coalesce(d.value, 0) as Value
    FROM ItemCityWeeks icw
    LEFT JOIN Data d ON d.Item = icw.Item AND d.City = icw.City 
          and d.First_Use_Y = icw.First_Use_Y and d.First_Use_W = icw.First_Use_W
    ORDER BY Item, City DESC
    

    See it work here:

    https://dbfiddle.uk/1PyTujMX

    Also note my First Use weeks are off by the one for the first city. I believe this is an error in the manually-created sample results, as the other cities all start with the same week as the Start_W, and this one city starts one week later.