Search code examples
pythonfor-loopbeautifulsoupweb-crawlerreddit

Python - Reddit web crawler using BeautifulSoup4 returns nothing


I've attempted to create a web crawler for Reddit's /r/all that gathers the links of the top posts. I have been following part one of thenewboston's web crawler tutorial series on YouTube.

In my code, I've removed the while loop that sets a limit to the number of pages to crawl in thenewboston's case (I'm only going to crawl the top 25 posts of /r/all, only one page). Of course, I've made these changes to suit the purpose of my web crawler.

In my code, I've changed the URL variable to 'http://www.reddit.com/r/all/' (for obvious reasons) and the Soup.findAll iterable to Soup.findAll('a', {'class': 'title may-blank loggedin'}) (title may-blank loggedin is the class of a title of a post on Reddit).

Here is my code:

import requests
from bs4 import BeautifulSoup

def redditSpider():
    URL = 'http://www.reddit.com/r/all/'
    sourceCode = requests.get(URL)
    plainText = sourceCode.text
    Soup = BeautifulSoup(plainText)
    for link in Soup.findAll('a', {'class': 'title may-blank loggedin'}):
        href = 'http://www.reddit.com/r/all/' + link.get('href')
        print(href)

redditSpider()

I've done some amateur bug-checking using print statements between each line and it seems that the for loop is not being executed.

To follow along or compare thenewboston's code with mine, skip to part two in his mini-series and find a spot in his video where his code is shown.

EDIT: thenewboston's code on request:

import requests
from bs4 import BeautifulSoup

def trade_spider(max_pages):
    page = 1
    while page <= max_pages:
        url = 'https://buckysroom.org/trade/search.php?page=' + str(page)
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text)
        for link in Soup.findAll('a', {'class': 'item-name'}):
            href = 'http://buckysroom.org' + link.get('href')
            print(href)
        page += 1

trade_spider()

Solution

  • So first of all, newboston seems to be a screencast so getting that code in there would be helpful.

    Secondly, I would recommend outputing the file locally so you can open it up in a browser and look around in Web Tools to look at what you want. I would also recommend using ipython to play around with BeautfulSoup on the file locally rather than scraping it every time.

    If you throw this in there you can accomplish that:

    plainText = sourceCode.text
    f = open('something.html', 'w')
    f.write(sourceCode.text.encode('utf8'))
    

    When I ran your code, first of all I had to wait because several times it gave me back an error page that I was requesting too often. That could be your first problem.

    When I did get the page, there were plenty of links but none with your classes. I'm not sure what 'title may-blank loggedin' is supposed to represent without watching that entire Youtube series.

    Now I see the problem

    It's the logged in class, you are not logged in with your scraper.

    You shouldn't need to login just to see /r/all, just use this instead:

    soup.findAll('a', {'class': 'title may-blank '})