Search code examples
pythonweb-scrapingprawreddit

python praw get comment and write in file format


I'm getting a subreddit's contents. The subreddit is AR. I need to get post ID, title, post content, author, post date, score, comments, and comment ID, then write into txt file. The problems I'm facing now are:

(1) Can I combine comments and comment ID into one file? Thus, it will be post ID, title, post content, author, post date, score, comments, and comment ID (2) The selftext I got has breaklines, so in my output.txt shows like

blablabla

blablabla

blablabla

For example, [this reddit][1] has multiple breaklines. I want the content all in one line because the data will be transferred into csv/excel for future analysis.

My code:

import praw, datetime, os
reddit = praw.Reddit('bot1')
subreddit = reddit.subreddit('AR')
for submission in subreddit.top(limit=1):
    date = datetime.datetime.utcfromtimestamp(submission.created_utc)

    for comment in submission.comments:
        print("Comment author: ", comment.author)
        print("Comments: ", comment.body)
        indexFile_comment = open('path' + 'index_comments.txt', 'a+')
        indexFile_comment.write('"' + str(comment.author) + '"' + ', ' + '"' + str(comment.body) + '"' + '\n')
    print("Post ID: ", submission.id)
    print("Title: ", submission.title)
    print("Post Content: ", submission.selftext)
    print("User Name: ", submission.author)
    print("Post Date: ", date)
    print("Point: ", submission.score)
    indexFile = open('path' + 'index.txt', 'a+')
    indexFile.write('"' + str(submission.id) + '"' + ', ' + '"' + str(submission.title) + '"' + ', ' + '"' + str(submission.selftext) + '"' + ', ' + '"' + str(submission.author) + '"' + ', ' + '"' + str(date) + '"' + ', ' + '"' + str(submission.score) + '"' + '\n')
    print ("Successfuly writing in file")
    indexFile.close()

Solution

  • To get the submission in one line you can implement st.replace("\n"," ") in your code. Where the variable st is submission.selftext.

    To get the comment ID you could do comment.id and to get the body comment.body within your for loop.

    Edit:

    In the first line, I have only added submission.id and submission.title but you can add the rest in the same manner. The loop adds the comments to end of the same string. After the for loop the I replace any new line characters with a space character. You can write record to a text file and when you go to the next submission you and append the next record to a new line in the text file.

    record = str(submission.id) + " " + str(submission.title) + " " 
    for comment in submission.comments:
        record = record + comment.author + " " + comment.body + " "
    record.replace("\n", " ")