regex and urllib.request to scrape links from HTML

I am trying to parse an HTML to extract all values in this regex construction : href="http//.+?"

This is the code:

import urllib.request
import re

url = input('Enter - ')
html = urllib.request.urlopen(url).read()
links = re.findall('href="(http://.*?)"',html)
for link in links:
    print(link)

But I am getting an error saying : TypeError: cannot use a string pattern on a bytes-like object

Solution

urlopen(url) returns a bytes object. So your html variable contains bytes as well. You can decode it using something like this:

htmlobject = urllib.request.urlopen(url)
html = htmlobject.read().decode('utf-8')

Then you can use html which is now a string, in your regex.

regex and urllib.request to __scrape__ links from HTML

regex and urllib.request to scrape links from HTML