Search code examples
python-3.xregexfindallpython-re

Python re.findall returns only first match


I'm somewhat stuck with this and didn't find a similar issue here.

I want to get a list of all the tag elements in the string like, e.g. <a> -> a or </b> -> b

import re

s = '<p><a href="http://www.quackit.com/html/tutorial/html_links.cfm">Example Link</a></p>'
pat = r'<\s*(\w+)/?\s*.*>'
tags = re.findall(pat, s)
print(tags)

Here I get ['p'] as a result. If I change the \w+ to [a-d]+ I just get ['a'] as a result.

I'd expect as result ['p', 'a', 'a', 'p'] or at least all the distinct tag values.

What did I do wrong here? Thank you!

Using Python 3.x


Solution

  • Firstly, you need to make your pattern match non-greedy (switch .* to .*?). You can read more about that in the examples given in the Python docs (they even use HTML tags as an example!).

    Secondly, the /? part should be at the start, rather than after the tag name \w+.

    Also, the second \s* is redundant, since .* will capture whitespaces as well.

    import re
    
    s = '<p><a href="http://www.quackit.com/html/tutorial/html_links.cfm">Example Link</a></p>'
    pat = r'</?\s*(\w+).*?>'
    tags = re.findall(pat, s)
    print(tags)
    

    Output:

    ['p', 'a', 'a', 'p']
    

    For a much more general solution, consider using BeautifulSoup or HTMLParser instead:

    from html.parser import HTMLParser
    
    class HTMLTagParser(HTMLParser):
    
        def handle_starttag(self, tag, attrs):
            tags.append(tag)
    
        def handle_endtag(self, tag):
            tags.append(tag)
    
    s = '<p><a href="http://www.quackit.com/html/tutorial/html_links.cfm">Example Link</a></p>'
    tags = []
    parser = HTMLTagParser()
    parser.feed(s)
    print(tags)
    

    Output:

    ['p', 'a', 'a', 'p']
    

    The approach will work arbitrary HTML (since regex can become messy as you minimize assumptions made). Note, for start tags, the attrs argument in handle_starttag can also be used to retrieve the attributes of the tag, should you need them.