Search code examples
pythonweb-scrapingpython-requestsreddit

Scraping subreddit top posts of all time using requests is returning the wrong result


I would like to scrape a subreddit for their top posts of all time. I know there is a PRAW module that may work better, but I would prefer to scrape using requests only for now.

import requests

url = "https://www.reddit.com/r/shittysuperpowers/top/?t=all.html"
headers = {"User-agent": "bot_0.1"}
res = requests.get(url, headers=headers)

res.status_code returned a 200, and the scrape was successful. But closer inspection of res.text revealed that the data html scraped is not from the desired page. In fact, what was scraped were from the top posts today rather than of all time, or from this url https://www.reddit.com/r/shittysuperpowers/top/?t=day.html. Is there any reason why I am unable to scrape the top posts of all time? I have tried this with other subreddits too and they all run into the same problem.


Solution

  • Use the .json modifier before the queries to get the data in json format.

    import requests
    url = 'https://www.reddit.com/r/shittysuperpowers/top/.json?sort=top&t=all'
    resp = requests.get(url, headers = {'User-agent': 'bot_0.1'})
    if resp.ok:
        data = resp.json()
    

    The .json modifier also works in browser.