Search code examples
python-3.xparsingurllib

How to parse content easily?


I am learning the Urllib functions. The parsing code I have wrote is not selecting all of the information off the webpage.

I have changed the User Agent header so the request shows as a real user. Some of the information is showing off the page but mostly the small print.

import urllib.request
import urllib.parse
import re

print('Webpage content surfer')

try:
    url = input('Enter full website address (http://, https://:> ')
    headers = {}
    headers['User-Agent'] = 'Mozilla/5.0 (x11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.27 Safari/537.17'
    req = urllib.request.Request(url, headers=headers)
    resp = urllib.request.urlopen(req)
    respdata = resp.read()


except Exception as e:
    print('That is not a valid website address\nCheck the web address'
          , (e))

content = re.findall(r'<p>(.*?)</p>', str(respdata))
for contents in content:
    print(contents)

I am not showing any errors but the content is not displaying all the content on the page. Is this due to requesting all of the information between paragraphs using

()

?


Solution

  • I just tested you code against http://example.com and it seems to display all contents between <p> .. </p> Is there a particular URL you are having issues with? I also suggest you use BeautifulSoup