Search code examples
pythonredditpraw

Fetching reddit data using praw into JSON Lines


So I'm trying to fetch reddit posts data using praw and turn it into a JSON Lines file.

What I need is something like this:

{"context": ["Cross your redstone wires - Snapshot 20w18a is out", "But how will people get a blood spot effect now if the redstone default is a cross again?"], "response": ["Debug Stick?"], "id": "gabsj3"}
{"context": ["Cross your redstone wires - Snapshot 20w18a is out", "But how will people get a blood spot effect now if the redstone default is a cross again?", "Debug Stick?"], "response": ["My guess is the dot is flat out gone\n\nThere's no way for it to exist so why would they leave it in"], "id": "gabsj3"}
{"context": ["Cross your redstone wires - Snapshot 20w18a is out", "But how will people get a blood spot effect now if the redstone default is a cross again?", "Debug Stick?", "My guess is the dot is flat out gone\n\nThere's no way for it to exist so why would they leave it in"], "response": ["No, it's still in the game. Use the debug stick to set all sides to `none`"], "id": "gabsj3"}

So context contains ["POST TITLE", "FIRST LEVEL COMMENT", "SECOND LEVEL COMMENT", "ETC..."] and response contains the last level comment. In this post on reddit, it should be:

{"context": ["Cross your redstone wires - Snapshot 20w18a is out", "But how will people get a blood spot effect now if the redstone default is a cross again?", "Debug Stick?", "My guess is the dot is flat out gone\n\nThere's no way for it to exist so why would they leave it in", "No, it's still in the game. Use the debug stick to set all sides to `none`"], "response": ["Huh, alright"], "id": "gabsj3"}

But the output of my code is something like this:

{"context": ["Cross your redstone wires - Snapshot 20w18a is out", "But how will people get a blood spot effect now if the redstone default is a cross again?"], "response": ["Debug Stick?", "I think we can still use resource packs to change it back into a dot, I don't know so don't quote me on that", "I honestly think the cross redstone looks a bit more like a splatter."], "id": "gabsj3"}

Here's my code:

import praw
import jsonlines

reddit = praw.Reddit(client_id='-', client_secret='-', user_agent='user_agent')

max = 1000
sequence =1
for post in reddit.subreddit('minecraft').new(limit=max):
data = []
title = []
comment = []
response = []
post_id = post.id
titl = post.title
# print("https://www.reddit.com/"+post.permalink)

print("Fetched "+str(sequence) + " posts .. ")
title.append(titl)
try:
    submission = reddit.submission(id=post_id)
    submission.comments.replace_more(limit=None)
    sequence = sequence + 1

    for top_level_comment in submission.comments:
        cmnt_body = top_level_comment.body
        comment.append(cmnt_body)
        for second_level_comment in top_level_comment.replies:
            response.append(second_level_comment.body)
        context = [title[0],comment[0]]
        data.append({"context":context,"response":response,"id":post_id})
        response = []
        # print(data[0])
        with jsonlines.open('2020-04-30_12.jsonl', mode='a') as writer:
            writer.write(data.pop())
        comment.pop()
    title.pop()


except Exception :
    pass

Solution

  • This is an interesting way to want to store the data. I can't say that I'd use this approach myself, since it involves duplicating the same information over and over again.

    To achieve this, you'll need to manage a stack containing the current context, and use recursion to get each comment's children:

    import jsonlines
    import praw
    
    reddit = praw.Reddit(...)  # fill in with your authentication
    
    
    def main():
        for post in reddit.subreddit("minecraft").new(limit=1000):
            dump_replies(replies=post.comments, context=[post.title])
    
    
    def dump_replies(replies, context):
        for reply in replies:
            if isinstance(reply, praw.models.MoreComments):
                continue
    
            reply_data = {
                "context": context,
                "response": reply.body,
                "id": reply.submission.id,
            }
            with jsonlines.open("2020-04-30_12.jsonl", mode="a") as writer:
                writer.write(reply_data)
    
            context.append(reply.body)
            dump_replies(reply.replies, context)
            context.pop()
    
    
    main()
    

    Before each recursive call, we append the current item's body to the context list, and then we remove it after recursing. This builds up a stack that shows the path down to the current comment. Then for every comment, we dump its context, its body, and its submission ID.

    Note that this won't dump anything for a post that has no comments, which seems to be in line with the strategy in your example data (as every line represents a comment that is a reply to something else).