I'm trying to collect the (let's say top 20) most upvoted comments within the most upvoted posts of a given subreddit.
Any help would be appreciated!
I've gotten to this code which I use in bigquery, but I can't seem to get the post score and the comment score without a duplicate issue.
SELECT posts.title, posts.score, comments.body, posts.subreddit
FROM `fh-bigquery.reddit_comments.2018_10` AS comments
JOIN `fh-bigquery.reddit_posts.2018_10` AS posts
ON posts.id = SUBSTR(comments.link_id, 4)
WHERE posts.subreddit = 'Showerthoughts'
For a simplified example I would like to be able to see:
Post Title 1 | Post Score | (Within Post Title 1) Comment Body 1 | Comment Score
Post Title 1 | Post Score | (Within Post Title 1) Comment Body 2 | Comment Score
Post Title 2 | Post Score | (Within Post Title 2) Comment Body 1 | Comment Score
Post Title 2 | Post Score | (Within Post Title 2) Comment Body 2 | Comment Score
Here is a quick way of getting around the issue of duplicate text blobs:
select title, score, body, subreddit from (
SELECT
to_hex(md5(posts.title)),
array_agg(posts.title)[offset(0)] as title,
array_agg(comments.body)[offset(0)] as body,
array_agg(posts.score)[offset(0)] as score,
array_agg(posts.subreddit)[offset(0)] as subreddit
FROM `fh-bigquery.reddit_comments.2018_10` AS comments
JOIN `fh-bigquery.reddit_posts.2018_10` AS posts
ON posts.id = SUBSTR(comments.link_id, 4)
WHERE posts.subreddit = 'Showerthoughts'
group by 1
order by 1
)
The idea is to convert that expensive text blob into a md5 hash, and then going about your usual business with the unique entries. You can sort stuff the way you want out of these distinct values.