I am downloading the top 100 posts in Reddit. Nevertheless, many are either external links, jpg files or other types of non-textual content. Therefore I get a list which mainly is composed of empty units. I was wondering if there is a way to retrieve only those entries that contain selftext
. Here is my code:
import json import nltk import re import pandas
appended_data = []
subreddit = reddit.subreddit('bitcoin')
top_python = subreddit.hot(limit=100) entries
for submission in top_python:
if not submission.stickied:
appended_data.append(submission.selftext)
str_list = list(filter(None, appended_data))
There is a built in flag for checking if something is a text post or not, is_self
. The updated version of your code would look a bit like this:
import json
import nltk
import re
import pandas
appended_data = []
subreddit = reddit.subreddit('bitcoin')
top_python = subreddit.hot(limit=100) entries
for submission in top_python:
if not submission.stickied and submission.is_self:
appended_data.append(submission.selftext)
str_list = list(filter(None, appended_data))
If you have any further questions don't hesitate to post a comment and ask!