Search code examples
sqlgoogle-bigqueryinner-jointemp-tables

How can I make a union of temporary tables with JOIN statement?


Through "WITH" statement I have obtained three temporary tables: "october_fall10", "november_fall11" and "december_fall12" Now I want to perform an INNER JOIN (the Query below) with them through the variable "member_casual", but instead I receive the following alert Error message: Table name "november_fall11" missing dataset while no default dataset is set in the request. Regardless that I wanted to do it with "december_fall12" the alert is repeated, but in reference to "december_fall12"

--This is Q1 from where I've got the 3 tables --

WITH october_fall10 AS
(SELECT 
   start_station_name,
   end_station_name,
   start_station_id,
   end_station_id,
   EXTRACT (DATE FROM started_at) AS start_date,
   EXTRACT(DAYOFWEEK FROM started_at) AS start_week_date,
   EXTRACT (TIME FROM started_at) AS start_time,    
   EXTRACT (DATE FROM ended_at) AS end_date,
   EXTRACT(DAYOFWEEK FROM ended_at) AS end_week_date,    
   EXTRACT (TIME FROM ended_at) AS end_time,
   DATETIME_DIFF (ended_at,started_at, MINUTE) AS total_lenght,
   member_casual
FROM 
   `ciclystic.cyclistic_seasonal_analysis.fall_202010` AS fall_analysis
ORDER BY 
   started_at DESC)
SELECT
   member_casual,
   start_week_date,
   COUNT (member_casual) AS member_casual_start,
   TIME(
   EXTRACT(hour   FROM AVG(start_time - '0:0:0')), 
   EXTRACT(minute FROM AVG(start_time - '0:0:0')), 
   EXTRACT(second FROM AVG(start_time - '0:0:0'))
   ) AS avg_start_time
FROM 
   october_fall10
GROUP BY
   start_week_date,
   member_casual
ORDER BY 
   start_week_date DESC;  
WITH november_fall11 AS
(SELECT
   start_station_name,
   end_station_name,
   start_station_id,
   end_station_id,
   EXTRACT (DATE FROM started_at) AS start_date,
   EXTRACT(DAYOFWEEK FROM started_at) AS start_week_date,
   EXTRACT (TIME FROM started_at) AS start_time,    
   EXTRACT (DATE FROM ended_at) AS end_date,
   EXTRACT(DAYOFWEEK FROM ended_at) AS end_week_date,    
   EXTRACT (TIME FROM ended_at) AS end_time,
   DATETIME_DIFF (ended_at,started_at, MINUTE) AS total_lenght,
   member_casual
FROM
   `ciclystic.cyclistic_seasonal_analysis.fall_202011` AS fall_analysis11 
ORDER BY 
   started_at DESC)
SELECT
   member_casual,
   start_week_date,
   COUNT (member_casual) AS member_casual_start,
   TIME(
   EXTRACT(hour   FROM AVG(start_time - '0:0:0')), 
   EXTRACT(minute FROM AVG(start_time - '0:0:0')), 
   EXTRACT(second FROM AVG(start_time - '0:0:0'))
   ) AS avg_start_time
FROM 
   november_fall11
GROUP BY
   start_week_date,
   member_casual
ORDER BY 
   start_week_date DESC;
WITH december_fall12 AS
(SELECT
   start_station_name,
   end_station_name,
   start_station_id,
   end_station_id,
   EXTRACT (DATE FROM started_at) AS start_date,
   EXTRACT(DAYOFWEEK FROM started_at) AS start_week_date,
   EXTRACT (TIME FROM started_at) AS start_time,    
   EXTRACT (DATE FROM ended_at) AS end_date,
   EXTRACT(DAYOFWEEK FROM ended_at) AS end_week_date,    
   EXTRACT (TIME FROM ended_at) AS end_time,
   DATETIME_DIFF (ended_at,started_at, MINUTE) AS total_lenght,
   member_casual
FROM
   `ciclystic.cyclistic_seasonal_analysis.fall_202012` AS fall_analysis11 
ORDER BY 
   started_at DESC)
SELECT
   member_casual,
   start_week_date,
   COUNT (member_casual) AS member_casual_start,
   TIME(
   EXTRACT(hour   FROM AVG(start_time - '0:0:0')), 
   EXTRACT(minute FROM AVG(start_time - '0:0:0')), 
   EXTRACT(second FROM AVG(start_time - '0:0:0'))
   ) AS avg_start_time
FROM 
   december_fall12
GROUP BY
   start_week_date,
   member_casual
ORDER BY 
   start_week_date DESC; 

--This is Q2 from where I want to get the full combination--

SELECT
       october_fall10.member_casual,
       october_fall10.start_week_date,
       october_fall10.member_casual_start,
       october_fall10.avg_start_time,
       november_fall11.member_casual,
       november_fall11.start_week_date,
       november_fall11.member_casual_start,
       november_fall11.avg_start_time,
       december_fall12.member_casual,
       december_fall12.start_week_date,
       december_fall12.member_casual_start,
       december_fall12.avg_start_time
   FROM
       ( october_fall10 JOIN (november_fall11 JOIN december_fall12 USING (member_casual)) 
       USING 
       (member_casual) )

Solution

  • First, when you create a 'table' using WITH, it's not a temporary table, it's a Common Table Expression (CTE). It's useful to understand the different terminology, as they behave in different ways (much like confusing petrol and diesel doesn't go so well).

    One of the ways that they differ is that a CTE is no persisted in any way. Once you make your final query, the CTE goes out of scope and can not be referenced again (after the ;, the CTE is 'gone').

    This means that the pattern you want is...

    WITH
      october_fall10 AS
    (
      SELECT 
        start_station_name,
        end_station_name,
        start_station_id,
        end_station_id,
        EXTRACT (DATE FROM started_at) AS start_date,
        EXTRACT(DAYOFWEEK FROM started_at) AS start_week_date,
        EXTRACT (TIME FROM started_at) AS start_time,    
        EXTRACT (DATE FROM ended_at) AS end_date,
        EXTRACT(DAYOFWEEK FROM ended_at) AS end_week_date,    
        EXTRACT (TIME FROM ended_at) AS end_time,
        DATETIME_DIFF (ended_at,started_at, MINUTE) AS total_lenght,
        member_casual
      FROM 
        `ciclystic.cyclistic_seasonal_analysis.fall_202010` AS fall_analysis
    ),
      october_fall10_aggregate AS
    (
      SELECT
         member_casual,
         start_week_date,
         COUNT(member_casual) AS member_casual_start,
         TIME(
           EXTRACT(hour   FROM AVG(start_time - '0:0:0')), 
           EXTRACT(minute FROM AVG(start_time - '0:0:0')), 
           EXTRACT(second FROM AVG(start_time - '0:0:0'))
         )
           AS avg_start_time
      FROM 
         october_fall10
      GROUP BY
         start_week_date,
         member_casual
    ),
      november_fall11 AS
    (
      <your query here>
    ),
      november_fall11_aggregate AS
    (
      <your query here>
    ),
      december_fall12 AS
    (
      <your query here>
    ),
      december_fall12_aggregate AS
    (
      <your query here>
    ),
    SELECT
      october_fall10_aggregate.member_casual,
      october_fall10_aggregate.start_week_date,
      october_fall10_aggregate.member_casual_start,
      october_fall10_aggregate.avg_start_time,
      november_fall11_aggregate.member_casual,
      november_fall11_aggregate.start_week_date,
      november_fall11_aggregate.member_casual_start,
      november_fall11_aggregate.avg_start_time,
      december_fall12_aggregate.member_casual,
      december_fall12_aggregate.start_week_date,
      december_fall12_aggregate.member_casual_start,
      december_fall12_aggregate.avg_start_time
    FROM
      october_fall10_aggregate
    JOIN
      november_fall11_aggregate
        USING (member_casual)
    JOIN
      december_fall12_aggregate
        USING (member_casual)
    

    This creates new CTEs with a _aggregate suffix, and then references those in your final query.

    • Note that I also removed the ORDER BY from each CTE
    • This is because CTEs (and table, temporary or otherwise) do not retain that ordering

    Also, repeating all that code is insane. You're much better off with all the data in a single table and writing the code once. Failing that you can union the tables 'as-if' they're one table...

    WITH
      unioned AS
    (
      SELECT 202010 AS source_tbl, * FROM `ciclystic.cyclistic_seasonal_analysis.fall_202010`
      UNION ALL
      SELECT 202011 AS source_tbl, * FROM `ciclystic.cyclistic_seasonal_analysis.fall_202011`
      UNION ALL
      SELECT 202012 AS source_tbl, * FROM `ciclystic.cyclistic_seasonal_analysis.fall_202012`
    ),
      extraction AS
    (
      SELECT 
        source_tbl,
        member_casual,
        start_station_name,
        end_station_name,
        start_station_id,
        end_station_id,
        EXTRACT (DATE FROM started_at) AS start_date,
        EXTRACT(DAYOFWEEK FROM started_at) AS start_week_date,
        EXTRACT (TIME FROM started_at) AS start_time,    
        EXTRACT (DATE FROM ended_at) AS end_date,
        EXTRACT(DAYOFWEEK FROM ended_at) AS end_week_date,    
        EXTRACT (TIME FROM ended_at) AS end_time,
        DATETIME_DIFF (ended_at,started_at, MINUTE) AS total_length
      FROM 
        unioned
    )
    SELECT
      member_casual,
      source_tbl,
      start_week_date,
      COUNT(member_casual) AS member_casual_start,
      TIME(
        EXTRACT(hour   FROM AVG(start_time - '0:0:0')), 
        EXTRACT(minute FROM AVG(start_time - '0:0:0')), 
        EXTRACT(second FROM AVG(start_time - '0:0:0'))
      )
        AS avg_start_time
    FROM 
      extraction
    GROUP BY
      member_casual,
      source_tbl,
      start_week_date
    ORDER BY
      member_casual,
      source_tbl,
      start_week_date
    

    This is NOT the same format as you had before, but is much more in keeping with the intended patterns for SQL.