Search code examples
google-bigqueryreddit

Having trouble joining multiple Reddit tables with an AS and ON clause


I'm trying to join the comments to the posts for multiple tables. I need an AS clause because the posts table and comments table share a column 'score'.

My goal is to be able to find the top comments within the top posts with the data in all these tables.

#standardSQL
SELECT posts.title, posts.url, posts.score AS postsscore, 
DATE_TRUNC(DATE(TIMESTAMP_SECONDS(posts.created_utc)), MONTH), 
comments.body, comments.score AS commentsscore, comments.id

FROM

fh-bigquery.reddit_posts.2015_12, fh-bigquery.reddit_posts.2016_01, fh-bigquery.reddit_posts.2016_02, fh-bigquery.reddit_posts.2016_03, fh-bigquery.reddit_posts.2016_04, fh-bigquery.reddit_posts.2016_05, fh-bigquery.reddit_posts.2016_06, fh-bigquery.reddit_posts.2016_07, fh-bigquery.reddit_posts.2016_08, fh-bigquery.reddit_posts.2016_09, fh-bigquery.reddit_posts.2016_10, fh-bigquery.reddit_posts.2016_11, fh-bigquery.reddit_posts.2016_12, fh-bigquery.reddit_posts.2017_01, fh-bigquery.reddit_posts.2017_02, fh-bigquery.reddit_posts.2017_03, fh-bigquery.reddit_posts.2017_04, fh-bigquery.reddit_posts.2017_05, fh-bigquery.reddit_posts.2017_06, fh-bigquery.reddit_posts.2017_07, fh-bigquery.reddit_posts.2017_08, fh-bigquery.reddit_posts.2017_09, fh-bigquery.reddit_posts.2017_10, fh-bigquery.reddit_posts.2017_11, fh-bigquery.reddit_posts.2017_12, fh-bigquery.reddit_posts.2018_01, fh-bigquery.reddit_posts.2018_02, fh-bigquery.reddit_posts.2018_03, fh-bigquery.reddit_posts.2018_04, fh-bigquery.reddit_posts.2018_05, fh-bigquery.reddit_posts.2018_06, fh-bigquery.reddit_posts.2018_07, fh-bigquery.reddit_posts.2018_08, fh-bigquery.reddit_posts.2018_09, fh-bigquery.reddit_posts.2018_10

AS posts

JOIN

fh-bigquery.reddit_comments.2015_12, fh-bigquery.reddit_comments.2016_01, fh-bigquery.reddit_comments.2016_02, fh-bigquery.reddit_comments.2016_03, fh-bigquery.reddit_comments.2016_04, fh-bigquery.reddit_comments.2016_05, fh-bigquery.reddit_comments.2016_06, fh-bigquery.reddit_comments.2016_07, fh-bigquery.reddit_comments.2016_08, fh-bigquery.reddit_comments.2016_09, fh-bigquery.reddit_comments.2016_10, fh-bigquery.reddit_comments.2016_11, fh-bigquery.reddit_comments.2016_12, fh-bigquery.reddit_comments.2017_01, fh-bigquery.reddit_comments.2017_02, fh-bigquery.reddit_comments.2017_03, fh-bigquery.reddit_comments.2017_04, fh-bigquery.reddit_comments.2017_05, fh-bigquery.reddit_comments.2017_06, fh-bigquery.reddit_comments.2017_07, fh-bigquery.reddit_comments.2017_08, fh-bigquery.reddit_comments.2017_09, fh-bigquery.reddit_comments.2017_10, fh-bigquery.reddit_comments.2017_11, fh-bigquery.reddit_comments.2017_12, fh-bigquery.reddit_comments.2018_01, fh-bigquery.reddit_comments.2018_02, fh-bigquery.reddit_comments.2018_03, fh-bigquery.reddit_comments.2018_04, fh-bigquery.reddit_comments.2018_05, fh-bigquery.reddit_comments.2018_06, fh-bigquery.reddit_comments.2018_07, fh-bigquery.reddit_comments.2018_08, fh-bigquery.reddit_comments.2018_09, fh-bigquery.reddit_comments.2018_10

AS comments

ON posts.id = SUBSTR(comments.link_id, 4)

WHERE posts.subreddit = 'Showerthoughts' AND posts.score >100 AND comments.score >100
ORDER BY posts.score DESC

My goal is to be able to find the top comments within the top posts with the data in all these tables.


Solution

  • Ok, so the issues with this query:

    • Be careful! This query will process a lot of data. I could re-cluster the tables to make this way more efficient, but I haven't yet.
    • In #standardSQL a comma means JOIN, not UNION. So you would need to UNION the tables.
    • Shortcut: You can append a * at the end of a table name to expand to all matching tables.
    • Use a backtick to escape the table names.

    With that said, a working query would be:

    #standardSQL
    SELECT posts.title, posts.url, posts.score AS postsscore, 
    DATE_TRUNC(DATE(TIMESTAMP_SECONDS(posts.created_utc)), MONTH), 
    SUBSTR(comments.body, 0, 80), comments.score AS commentsscore, comments.id
    
    FROM `fh-bigquery.reddit_posts.2015*` AS posts
    JOIN `fh-bigquery.reddit_comments.2015*` AS comments
    
    ON posts.id = SUBSTR(comments.link_id, 4)
    
    WHERE posts.subreddit = 'Showerthoughts' 
    AND posts.score >100 
    AND comments.score >100
    ORDER BY posts.score DESC