I'm somewhat stuck with this and didn't find a similar issue here.
I want to get a list of all the tag elements in the string like, e.g. <a>
-> a
or </b>
-> b
import re
s = '<p><a href="http://www.quackit.com/html/tutorial/html_links.cfm">Example Link</a></p>'
pat = r'<\s*(\w+)/?\s*.*>'
tags = re.findall(pat, s)
print(tags)
Here I get ['p']
as a result. If I change the \w+
to [a-d]+
I just get ['a']
as a result.
I'd expect as result ['p', 'a', 'a', 'p']
or at least all the distinct tag values.
What did I do wrong here? Thank you!
Using Python 3.x
Firstly, you need to make your pattern match non-greedy (switch .*
to .*?
). You can read more about that in the examples given in the Python docs (they even use HTML tags as an example!).
Secondly, the /?
part should be at the start, rather than after the tag name \w+
.
Also, the second \s*
is redundant, since .*
will capture whitespaces as well.
import re
s = '<p><a href="http://www.quackit.com/html/tutorial/html_links.cfm">Example Link</a></p>'
pat = r'</?\s*(\w+).*?>'
tags = re.findall(pat, s)
print(tags)
Output:
['p', 'a', 'a', 'p']
For a much more general solution, consider using BeautifulSoup
or HTMLParser
instead:
from html.parser import HTMLParser
class HTMLTagParser(HTMLParser):
def handle_starttag(self, tag, attrs):
tags.append(tag)
def handle_endtag(self, tag):
tags.append(tag)
s = '<p><a href="http://www.quackit.com/html/tutorial/html_links.cfm">Example Link</a></p>'
tags = []
parser = HTMLTagParser()
parser.feed(s)
print(tags)
Output:
['p', 'a', 'a', 'p']
The approach will work arbitrary HTML (since regex can become messy as you minimize assumptions made). Note, for start tags, the attrs
argument in handle_starttag
can also be used to retrieve the attributes of the tag, should you need them.