Search code examples
pythonregexbeautifulsouprobobrowser

Returning multiple matches with RoboBrowser/BeautifulSoup


I'm trying to get multiple regex-matches with the find/find_all-method, but can't get it to work.

A piece of the HTML-code can be something like:

<b>Week</b> 22: 3871983

Then in code I'm trying the following:

import re
from robobrowser import RoboBrowser

browser = RoboBrowser(parser='html.parser')
browser.open(some_url_containing_the_above_html_code)
result = browser.find_all(text=re.compile('Week\s+(\d+).*?(\d+)'))

print(result)

Which outputs something like:

['Week 22:\xa3871983']

I expected something like:

['22', '3871983']

Does the \xa ruins it? Or won't you be able to return multiple matches within a single regex? Don't really know how to solve it. I could always store the return value in a string and parse it one more time with a split or regex, but I'd rather like to get it directly with find or find_all.


Solution

  • A misunderstanding about the find_all function. All that it does return a list of elements that match the given condition. In your case it's a regex. Your regex has subpatterns. But that is not really relevent here. find_all does not split by the regex. So

    ['Week 22:\xa3871983']
    

    is the expected result. If you want this converted into ['22', '3871983']

     import re
     for result in results:
         parts = re.split("\s", result)
         parts[0] = parts[0][4:]