Search code examples
google-bigqueryreddit

Need help querying top reddit comments by top posts


I'm trying to collect the (let's say top 20) most upvoted comments within the most upvoted posts of a given subreddit.

Any help would be appreciated!

I've gotten to this code which I use in bigquery, but I can't seem to get the post score and the comment score without a duplicate issue.

SELECT posts.title, posts.score, comments.body, posts.subreddit
FROM `fh-bigquery.reddit_comments.2018_10` AS comments
JOIN `fh-bigquery.reddit_posts.2018_10`  AS posts
ON posts.id = SUBSTR(comments.link_id, 4) 
WHERE posts.subreddit = 'Showerthoughts'

For a simplified example I would like to be able to see:

Post Title 1 | Post Score | (Within Post Title 1) Comment Body 1 | Comment Score

Post Title 1 | Post Score | (Within Post Title 1) Comment Body 2 | Comment Score

Post Title 2 | Post Score | (Within Post Title 2) Comment Body 1 | Comment Score

Post Title 2 | Post Score | (Within Post Title 2) Comment Body 2 | Comment Score


Solution

  • Here is a quick way of getting around the issue of duplicate text blobs:

    select title, score, body, subreddit from (
        SELECT 
            to_hex(md5(posts.title)), 
            array_agg(posts.title)[offset(0)] as title, 
            array_agg(comments.body)[offset(0)] as body, 
            array_agg(posts.score)[offset(0)] as score, 
            array_agg(posts.subreddit)[offset(0)] as subreddit
        FROM `fh-bigquery.reddit_comments.2018_10` AS comments
        JOIN `fh-bigquery.reddit_posts.2018_10`  AS posts
        ON posts.id = SUBSTR(comments.link_id, 4) 
        WHERE posts.subreddit = 'Showerthoughts'
        group by 1
        order by 1
    )
    

    The idea is to convert that expensive text blob into a md5 hash, and then going about your usual business with the unique entries. You can sort stuff the way you want out of these distinct values.