I am working on Sentiment Analysis using The Reddit API Praw. My code is below:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import praw
from IPython import display
from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIA
from pprint import pprint
import pandas as pd
import nltk
import seaborn as sns
import datetime
sns.set(style='darkgrid', context='talk', palette='Dark2')
reddit = praw.Reddit(client_id='XXXXXXXXXXX',
client_secret='XXXXXXXXXXXXXXXXXXX',
user_agent='StackOverflow')
headlines = set()
results = []
sia = SIA()
for submission in reddit.subreddit('bitcoin').new(limit=None):
pol_score = sia.polarity_scores(submission.title)
pol_score['headline'] = submission.title
readable = datetime.datetime.fromtimestamp(submission.created_utc).isoformat()
results.append((submission.title, readable, pol_score["compound"]))
display.clear_output()
Question A: With this code I can extract only the title of the text and so other few keys. I would like to extract everything in JSON format, but studying the documentation I haven't seen if it is possible.
If I call only submission in reddit.subreddit('bitcoin') It turn out only the id code. I would like to exctract everything, any information and save it in a JSON file.
Question B: How could I extract comments/messages from a specific day?
Question A:
You could simply add a .json
at the end of the full url of the post to get the full Json for that page which includes title, author, comments, votes and everything else.
Once you get the full url of the post using submission.permalink
. You could use requests
to get the Json for that page.
import requests
url = submission.permalink
response = requests.get('http' + url + '.json')
json = response.content # your Json
Question B:
Unfortunately, Reddit removed timestamp search from their search api sometime last year. Here's an announcement post about it.
Besides some minor syntax differences, the most notable change is that searches by exact timestamp are no longer supported on the newer system. Limiting results to the past hour, day, week, month and year is still supported via the ?t= parameter (e.g. ?t=day)
So, there's currently no way of doing this using Praw
. But you could look into Pushshift api which provides this functionality.