Python re.findall returns only first match

I'm somewhat stuck with this and didn't find a similar issue here.

I want to get a list of all the tag elements in the string like, e.g. <a> -> a or </b> -> b

import re

s = '<p><a href="http://www.quackit.com/html/tutorial/html_links.cfm">Example Link</a></p>'
pat = r'<\s*(\w+)/?\s*.*>'
tags = re.findall(pat, s)
print(tags)

Here I get ['p'] as a result. If I change the \w+ to [a-d]+ I just get ['a'] as a result.

I'd expect as result ['p', 'a', 'a', 'p'] or at least all the distinct tag values.

What did I do wrong here? Thank you!

Using Python 3.x

Solution

Firstly, you need to make your pattern match non-greedy (switch .* to .*?). You can read more about that in the examples given in the Python docs (they even use HTML tags as an example!).

Secondly, the /? part should be at the start, rather than after the tag name \w+.

Also, the second \s* is redundant, since .* will capture whitespaces as well.

import re

s = '<p><a href="http://www.quackit.com/html/tutorial/html_links.cfm">Example Link</a></p>'
pat = r'</?\s*(\w+).*?>'
tags = re.findall(pat, s)
print(tags)

Output:

['p', 'a', 'a', 'p']

For a much more general solution, consider using BeautifulSoup or HTMLParser instead:

from html.parser import HTMLParser

class HTMLTagParser(HTMLParser):

    def handle_starttag(self, tag, attrs):
        tags.append(tag)

    def handle_endtag(self, tag):
        tags.append(tag)

s = '<p><a href="http://www.quackit.com/html/tutorial/html_links.cfm">Example Link</a></p>'
tags = []
parser = HTMLTagParser()
parser.feed(s)
print(tags)

Output:

['p', 'a', 'a', 'p']

The approach will work arbitrary HTML (since regex can become messy as you minimize assumptions made). Note, for start tags, the attrs argument in handle_starttag can also be used to retrieve the attributes of the tag, should you need them.