Search code examples
pythonscreen-scrapingurllib

Check if python urlopen has finished loading


I'm writing a page scraper using beautiful soup, and have noticed it will sometimes try to parse a page, even though it hasn't completely loaded.

What I'm doing is something like this:

soup = BeautifulSoup(urllib.urlopen(page))

I'm not very good with Python, but I think there must be a way for me to know that the page has finished loading, so I can start scraping it.

The reason why I know it's not waiting until it's all loaded, is because the script will work most of the times, but will error some other times saying the element I'm looking for on the page isn't there (yet)

Could anyone give me a hand with this?


Solution

  • Try reading everything into a string:

    html = urllib.urlopen(page).read()
    soup = BeautifulSoup(html)
    

    While the BS docs say passing an open file object is fine, trying it like this is a good idea. If it still fails it means it's not related to BS at all. In this case, print html to see what you receive. Maybe it's just because you are not logged in to the site when accessing it from your python script or something similar.