Search code examples
pythonpython-3.xredditpraw

Reddit PRAW API: Extracting entire JSON format


I am working on Sentiment Analysis using The Reddit API Praw. My code is below:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import praw
from IPython import display
from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIA
from pprint import pprint
import pandas as pd
import nltk
import seaborn as sns
import datetime

sns.set(style='darkgrid', context='talk', palette='Dark2')

reddit = praw.Reddit(client_id='XXXXXXXXXXX',
                     client_secret='XXXXXXXXXXXXXXXXXXX',
                     user_agent='StackOverflow')

headlines = set()
results = []
sia = SIA()

for submission in reddit.subreddit('bitcoin').new(limit=None):
    pol_score = sia.polarity_scores(submission.title)
    pol_score['headline'] = submission.title
    readable = datetime.datetime.fromtimestamp(submission.created_utc).isoformat()
    results.append((submission.title, readable, pol_score["compound"]))
    display.clear_output()

Question A: With this code I can extract only the title of the text and so other few keys. I would like to extract everything in JSON format, but studying the documentation I haven't seen if it is possible.

If I call only submission in reddit.subreddit('bitcoin') It turn out only the id code. I would like to exctract everything, any information and save it in a JSON file.

Question B: How could I extract comments/messages from a specific day?


Solution

  • Question A:

    You could simply add a .json at the end of the full url of the post to get the full Json for that page which includes title, author, comments, votes and everything else.

    Once you get the full url of the post using submission.permalink. You could use requests to get the Json for that page.

    import requests
    
    url = submission.permalink
    response = requests.get('http' + url + '.json') 
    json = response.content # your Json
    

    Question B:

    Unfortunately, Reddit removed timestamp search from their search api sometime last year. Here's an announcement post about it.

    Besides some minor syntax differences, the most notable change is that searches by exact timestamp are no longer supported on the newer system. Limiting results to the past hour, day, week, month and year is still supported via the ?t= parameter (e.g. ?t=day)

    So, there's currently no way of doing this using Praw. But you could look into Pushshift api which provides this functionality.