Search code examples
pythonregexpython-repython-webbrowser

re.findall picking up only second digit of a two digit number in a web page


I am trying to parse an HTML Page using Regualr Expressions. I have to find out the sum of all comments from this web page: https://py4e-data.dr-chuck.net/comments_42.html Everything else is working fine but the re.findall function is only picking up second digit of a two digit number. I am not able to figure out why is this happening.

This is my code:

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
import re
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
code = list()
html = urllib.request.urlopen("https://py4e-data.dr-chuck.net/comments_42.html", context=ctx)
for line in html:
    line = line.decode()
    line = line.strip()
    numbers = re.findall("<span.+([0-9]+)", line)
    if len(numbers) != 1: continue
    print(numbers)

This is my output: (I am geting 7 instead of 97, 0 instead of 90) output


Solution

  • Regexes are greedy by default (not just in Python, in basically every regex system I'm aware of), so they try to take as many characters as possible for each variable length match (e.g. * and +) in the regex, from left to right, so long as they can still match with what remains. As such, the .+ in <span.+([0-9]+) is matching all the characters save the very last one (which must be left to match [0-9]+), so [0-9]+ can never match more than one.

    You can solve this in various ways:

    1. If the characters between span and the desired digits will never be digits themselves, only match non-digits instead of ., e.g. r"<span[^0-9]+([0-9]+)" (note: I used an r prefix to make that a raw string, which you should always do with Python regex literals to avoid issues with string escapes overlapping regex escapes; it would allow you to safely use \D and \d in place of [^0-9] and [0-9] respectively if you liked, and weren't concerned with non-ASCII digits). The regex is still greedy, and should perform equally well, but it will stop at the first run of digits and capture all of them, rather than capturing only the final digit of the last run of digits.

    2. If they might be digits, and you want to capture the last digits, make the .+ non-greedy by changing the regex to r"<span.+?([0-9]+)". The ? after the + means "match the fewest characters possible", rather than the default greedy "match as many as possible". It will typically make the regex run a little slower, but not enough to matter in most cases.