Search code examples
sqlsubquerysnowflake-cloud-data-platformorder-of-execution

identifier not being recognized in a 'with' statement using a subquery in Snowflake


so I have a table that's sort of like this:

DELIVERY_AREA_ID,DELIVERY_RADIUS_METERS,EVENT_STARTED_TIMESTAMP
234sfd,4000,2020-01-01 12:19:29.719
234sfd,6500,2020-01-01 12:31:40.325
234sfd,3500,2020-01-01 12:53:10.538
234sfd,6500,2020-01-01 13:11:36.094
234sfd,3500,2020-01-01 13:32:26.754
234sfd,6500,2020-01-01 13:59:11.104
234sfd,6500,2020-01-02 07:44:16.792
234sfd,3500,2020-01-02 08:07:36.284
234sfd,6500,2020-01-02 08:54:08.014
234sfd,3500,2020-01-02 09:53:05.853
234sfd,6500,2020-01-02 10:04:39.443
234sfd,10000,2020-07-01 08:29:20.194
234sfd,3500,2020-07-03 07:50:41.782
234sfd,10000,2020-07-03 08:33:14.695
234sfd,3500,2020-07-05 07:47:05.539
234sfd,10000,2020-07-05 07:53:13.930
234sfd,3500,2020-07-05 09:18:57.688
234sfd,10000,2020-07-05 09:51:07.547
234sfd,3500,2020-07-19 18:02:14.099

the data is actually much more varied but yeah it follows that format.

I am trying to, in one query, in snowflake database, make a 'default delivery radius' which is just the radius by month/year that has the longest duration, and then calculate the total durations in that month that had less than this default delivery radius. currently by joining. I don't want to create new tables, I know that would be easier though.

Here is my current attempt:

-- Find the default delivery radius for each delivery area
WITH default_radiuses AS (
SELECT DELIVERY_AREA_ID,
       MAX(DELIVERY_RADIUS_METERS) AS default_delivery_radius,
       MONTH_YEAR,
       DELIVERY_RADIUS_METERS,
       SUM(DURATION_SECONDS) AS total_duration,
       max(EVENT_STARTED_TIMESTAMP) as MAX_TIMESTAMP,
       RANK() OVER (PARTITION BY DELIVERY_AREA_ID, MONTH_YEAR ORDER BY total_duration DESC) AS RADIUS_RANK
FROM (
    -- Add the MONTH_YEAR column to the delivery_radius_log table
    SELECT DELIVERY_AREA_ID,
           DELIVERY_RADIUS_METERS,
           EVENT_STARTED_TIMESTAMP,
           CONCAT(MONTH(EVENT_STARTED_TIMESTAMP), '/', YEAR(EVENT_STARTED_TIMESTAMP)) AS MONTH_YEAR,
           DATEADD(second, DATEDIFF(second, EVENT_STARTED_TIMESTAMP, LEAD(EVENT_STARTED_TIMESTAMP) OVER (PARTITION BY DELIVERY_AREA_ID ORDER BY EVENT_STARTED_TIMESTAMP)), EVENT_STARTED_TIMESTAMP) AS end_timestamp,
           DATEDIFF(second, EVENT_STARTED_TIMESTAMP, LEAD(EVENT_STARTED_TIMESTAMP) OVER (PARTITION BY DELIVERY_AREA_ID ORDER BY EVENT_STARTED_TIMESTAMP)) AS duration_seconds
    FROM delivery_radius_log
)
GROUP BY DELIVERY_AREA_ID, MONTH_YEAR, DELIVERY_RADIUS_METERS
ORDER BY RADIUS_RANK asc, month(MAX_TIMESTAMP)
)

-- Find the duration of each radius reduction
SELECT DELIVERY_AREA_ID,
       CONCAT(MONTH(EVENT_STARTED_TIMESTAMP), '/', YEAR(EVENT_STARTED_TIMESTAMP)) AS MONTH_YEAR,
       DELIVERY_RADIUS_METERS,
       EVENT_STARTED_TIMESTAMP,
       DATEADD(hour, DATEDIFF(hour, EVENT_STARTED_TIMESTAMP, LEAD(EVENT_STARTED_TIMESTAMP) OVER (PARTITION BY DELIVERY_AREA_ID ORDER BY EVENT_STARTED_TIMESTAMP)), EVENT_STARTED_TIMESTAMP) AS end_timestamp,
       DATEDIFF(hour, EVENT_STARTED_TIMESTAMP, LEAD(EVENT_STARTED_TIMESTAMP) OVER (PARTITION BY DELIVERY_AREA_ID ORDER BY EVENT_STARTED_TIMESTAMP)) AS duration_hours
FROM delivery_radius_log
JOIN default_radiuses USING (DELIVERY_AREA_ID, MONTH_YEAR)
WHERE DELIVERY_RADIUS_METERS != default_delivery_radius

I currently get the error 'Invalid identifier MONTH_YEAR'. what can I do to resolve this?

The two queries work separately, so I guess I am just missing an order of execution thing? in my mind it works out but I'm not sure obviously

ok after some feedbacks I have added an alias, dropped the order by, but I still think I have some issues with my join:

-- Find the default delivery radius for each delivery area
WITH default_radiuses AS (
    SELECT DELIVERY_AREA_ID,
           MAX(DELIVERY_RADIUS_METERS) AS default_delivery_radius,
           MONTH_YEAR,
           DELIVERY_RADIUS_METERS,
           SUM(DURATION_SECONDS) AS total_duration,
           MAX(EVENT_STARTED_TIMESTAMP) AS MAX_TIMESTAMP,
           RANK() OVER (PARTITION BY DELIVERY_AREA_ID, MONTH_YEAR
                        ORDER BY SUM(DURATION_SECONDS) DESC) AS RADIUS_RANK
    FROM (
        -- Add the MONTH_YEAR column to the delivery_radius_log table
        SELECT DELIVERY_AREA_ID,
               DELIVERY_RADIUS_METERS,
               EVENT_STARTED_TIMESTAMP,
               CONCAT(MONTH(EVENT_STARTED_TIMESTAMP), '/',
                      YEAR(EVENT_STARTED_TIMESTAMP)) AS MONTH_YEAR,
               DATEADD(second, DATEDIFF(second, EVENT_STARTED_TIMESTAMP, LEAD(EVENT_STARTED_TIMESTAMP) OVER (PARTITION BY DELIVERY_AREA_ID ORDER BY EVENT_STARTED_TIMESTAMP)), EVENT_STARTED_TIMESTAMP) AS end_timestamp,
               DATEDIFF(second, EVENT_STARTED_TIMESTAMP, LEAD(EVENT_STARTED_TIMESTAMP) OVER (PARTITION BY DELIVERY_AREA_ID ORDER BY EVENT_STARTED_TIMESTAMP)) AS duration_seconds
        FROM delivery_radius_log
    ) t  -- added alias here
    GROUP BY DELIVERY_AREA_ID, MONTH_YEAR, DELIVERY_RADIUS_METERS
)


-- Find the duration of each radius reduction
SELECT a.DELIVERY_AREA_ID,
       CONCAT(MONTH(EVENT_STARTED_TIMESTAMP), '/', YEAR(EVENT_STARTED_TIMESTAMP)) AS MONTH_YEAR,
       a.DELIVERY_RADIUS_METERS,
       EVENT_STARTED_TIMESTAMP,
       DATEADD(hour, DATEDIFF(hour, EVENT_STARTED_TIMESTAMP, LEAD(EVENT_STARTED_TIMESTAMP) OVER (PARTITION BY a.DELIVERY_AREA_ID ORDER BY EVENT_STARTED_TIMESTAMP)), EVENT_STARTED_TIMESTAMP) AS end_timestamp,
       DATEDIFF(hour, EVENT_STARTED_TIMESTAMP, LEAD(EVENT_STARTED_TIMESTAMP) OVER (PARTITION BY a.DELIVERY_AREA_ID ORDER BY EVENT_STARTED_TIMESTAMP)) AS duration_hours
FROM delivery_radius_log a
JOIN default_radiuses b on (a.DELIVERY_AREA_ID = b.DELIVERY_AREA_ID,CONCAT(MONTH(EVENT_STARTED_TIMESTAMP), '/',
                      YEAR(EVENT_STARTED_TIMESTAMP)) = b.MONTH_YEAR)
WHERE a.DELIVERY_RADIUS_METERS != default_delivery_radius

I get the error

Invalid data type [ROW(BOOLEAN, BOOLEAN)] for predicate [ROW(A.DELIVERY_AREA_ID = B.DELIVERY_AREA_ID, (CONCAT(CAST(EXTRACT(month from A.EVENT_STARTED_TIMESTAMP) AS VARCHAR(16777216)), '/', CAST(EXTRACT(year from A.EVENT_STARTED_TIMESTAMP) AS VARCHAR(16777216)))) = B.MONTH_YEAR)]

I don't get how there is an invalid join if I use the same method to create that field in the earlier with statement


Solution

  • Every derived table/subquery requires an alias in SQL. In addition, there is no point in using an ORDER BY clause in your CTE, so it should be removed. Also, you cannot refer to the total_duration alias in the select in the same select in which it was defined. Instead, just repeat the SUM() expression. Making both of these changes we can try:

    WITH default_radiuses AS (
        SELECT DELIVERY_AREA_ID,
               MAX(DELIVERY_RADIUS_METERS) AS default_delivery_radius,
               MONTH_YEAR,
               DELIVERY_RADIUS_METERS,
               SUM(DURATION_SECONDS) AS total_duration,
               MAX(EVENT_STARTED_TIMESTAMP) AS MAX_TIMESTAMP,
               RANK() OVER (PARTITION BY DELIVERY_AREA_ID, MONTH_YEAR
                            ORDER BY SUM(DURATION_SECONDS) DESC) AS RADIUS_RANK
        FROM (
            -- Add the MONTH_YEAR column to the delivery_radius_log table
            SELECT DELIVERY_AREA_ID,
                   DELIVERY_RADIUS_METERS,
                   EVENT_STARTED_TIMESTAMP,
                   CONCAT(MONTH(EVENT_STARTED_TIMESTAMP), '/',
                          YEAR(EVENT_STARTED_TIMESTAMP)) AS MONTH_YEAR,
                   DATEADD(second, DATEDIFF(second, EVENT_STARTED_TIMESTAMP, LEAD(EVENT_STARTED_TIMESTAMP) OVER (PARTITION BY DELIVERY_AREA_ID ORDER BY EVENT_STARTED_TIMESTAMP)), EVENT_STARTED_TIMESTAMP) AS end_timestamp,
                   DATEDIFF(second, EVENT_STARTED_TIMESTAMP, LEAD(EVENT_STARTED_TIMESTAMP) OVER (PARTITION BY DELIVERY_AREA_ID ORDER BY EVENT_STARTED_TIMESTAMP)) AS duration_seconds
            FROM delivery_radius_log
        ) t  -- added alias here
        GROUP BY DELIVERY_AREA_ID, MONTH_YEAR, DELIVERY_RADIUS_METERS
    )
    
    SELECT ...
    -- your query here