I am trying to parse an HTML to extract all values in this regex construction : href="http//.+?"
This is the code:
import urllib.request
import re
url = input('Enter - ')
html = urllib.request.urlopen(url).read()
links = re.findall('href="(http://.*?)"',html)
for link in links:
print(link)
But I am getting an error saying : TypeError: cannot use a string pattern on a bytes-like object
urlopen(url)
returns a bytes object. So your html
variable contains bytes as well. You can decode it using something like this:
htmlobject = urllib.request.urlopen(url)
html = htmlobject.read().decode('utf-8')
Then you can use html
which is now a string, in your regex.