Search code examples
pythonregexhtml-parsingurllib

regex and urllib.request to __scrape__ links from HTML


I am trying to parse an HTML to extract all values in this regex construction : href="http//.+?"

This is the code:

import urllib.request
import re

url = input('Enter - ')
html = urllib.request.urlopen(url).read()
links = re.findall('href="(http://.*?)"',html)
for link in links:
    print(link)

But I am getting an error saying : TypeError: cannot use a string pattern on a bytes-like object


Solution

  • urlopen(url) returns a bytes object. So your html variable contains bytes as well. You can decode it using something like this:

    htmlobject = urllib.request.urlopen(url)
    html = htmlobject.read().decode('utf-8')
    

    Then you can use html which is now a string, in your regex.