Search code examples
pythonencodingpython-reurlopenpython-3.9

Apply regexp to urlopen request


I'm trying to apply a regexp filter on the result page of urlopen(req) :

from urllib.request import urlopen, Request
import re
from contextlib import closing

req = Request('https://yts-subs.com/movie-imdb/tt1483013')
req.add_header('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36')
webpage = urlopen(req)
encoding = webpage.headers.get_content_charset('charset')

# page = str(webpage.read(), encoding)
page = webpage.read().decode('utf-8')

pattern = re.compile(r'<tr data-id=".*?"(?: class="((?:high|low)-rating)")?>\s*<td class="rating-cell">\s*.*</span>\n\s*</td>\n\s*<td class.*\n\s*<span.*>.*</span>\n\s*<span class="sub-lang">(.*)?</span>\n\s*</td>\n\s*<td>\n\s*<a href="(.*)?">'
                     ,re.UNICODE)
print(pattern.findall(page))

But for some reasons, it doesn't match anything. The pattern should be ok, I tested it alone and the page read exist. Suspecting an encoding error, I tried to str() or decode it without much success. The strange thing that puzzle me is : if I write an intermediate file and read it, it works ...

Adding this just before the pattern make it work :

with open('temp.data', 'w') as data:
  data.write(page)
page = ''
with open('temp.data','r') as data:
  page=''.join(data.readlines())

Obviously I'm making something wrong, I would appreciate some hint !


Solution

  • Ok, it turned out that my regexp pattern was the problem. By rewriting it with more precision, it worked. Here is the good pattern :

    pattern = re.compile(r'<tr data-id=".*?"(?: class="((?:high|low)-rating)")?>\s*<td class="rating-cell">\s*.*</span>\s*</td>\s*<td class.*\s*<span.*>.*</span>\s*<span class="sub-lang">(.*)?</span>\s*</td>\s*<td>\s*<a href="([^">]*)?')
    

    and the wrong one for comparison :

    pattern = re.compile(r'<tr data-id=".*?"(?: class="((?:high|low)-rating)")?>\s*<td class="rating-cell">\s*.*</span>\n\s*</td>\n\s*<td class.*\n\s*<span.*>.*</span>\n\s*<span class="sub-lang">(.*)?</span>\n\s*</td>\n\s*<td>\n\s*<a href="(.*)?">'
                         ,re.UNICODE)
    

    Thanks for your help, I will investigate answered alternatives !