Search code examples
python-requestsurllib2urllib

how to port python urllib2 app (a web scraper) that uses Beautiful Soup 4 to use requests package instead


I am trying to update web scraper app that uses Beautiful Soup 4 in Python 3 in Anaconda to use the Requests package instead of urllib, urllib2 and urllib3.

urllib and urllib2 don't exist in the Anaconda channels and from what I have read requests package has made urllib and urllib2 obsolete. I am still rather new in Python programming for web scraping, and don't yet fully understand all concepts and internal subtleties of these 4 packages.

When I replace "urllib2.urlopen()" with "requests.get()", I get the following error:

import requests from bs4 import BeautifulSoup

'''replace the following line with "page =  Request.get(url)" '''
#   page = urllib2.urlopen(url)
page = requests.get(url)
soup_page = BeautifulSoup(page,"lxml")

I get the following error message with no explanation in the bs4 module: File "C:\ProgramData\Anaconda3\lib\site-packages\bs4__init__.py", line 246, in init elif len(markup) <= 256 and (

TypeError: object of type 'Response' has no len()

This error message puts me deep into the bowels of init.py in bs4.

I cannot find an explanation of how to port urllib or urllib2 code to requests with Beautiful Soup 4.

Can anyone provide an explicit guide on how to port urllib / urllib2 apps to use requests with beautiful soup in Python 3?

Anaconda / conda does not import urllib or urllib2 into Python 3 environments.

Thank you.

Rich


Solution

  • The error occurs because you're trying to pass the html code of the response to Beautifulsoup in the wrong way. Pass response.text, instead of the response object:

    # page = urllib2.urlopen(url)
    
    page = requests.get(url)
    
    soup_page = BeautifulSoup(page.text, "lxml")
    

    You may need to read requests documentation