I am learning the Urllib functions. The parsing code I have wrote is not selecting all of the information off the webpage.
I have changed the User Agent header so the request shows as a real user. Some of the information is showing off the page but mostly the small print.
import urllib.request
import urllib.parse
import re
print('Webpage content surfer')
try:
url = input('Enter full website address (http://, https://:> ')
headers = {}
headers['User-Agent'] = 'Mozilla/5.0 (x11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.27 Safari/537.17'
req = urllib.request.Request(url, headers=headers)
resp = urllib.request.urlopen(req)
respdata = resp.read()
except Exception as e:
print('That is not a valid website address\nCheck the web address'
, (e))
content = re.findall(r'<p>(.*?)</p>', str(respdata))
for contents in content:
print(contents)
I am not showing any errors but the content is not displaying all the content on the page. Is this due to requesting all of the information between paragraphs using
()
?I just tested you code against http://example.com and it seems to display all contents between <p> .. </p>
Is there a particular URL you are having issues with?
I also suggest you use BeautifulSoup