I'm trying to parse complex HTML structures using Python's re module, and I've run into a roadblock with my regex pattern. Here's what I'm trying to do:
I have HTML text that contains nested elements, and I want to extract the content of the innermost tags. However, I can't seem to get my regex pattern right. Here's the code I'm using:
import re
html_text = """
<div>
<div>
<div>
Innermost Content 1
</div>
</div>
<div>
Innermost Content 2
</div>
</div>
"""
pattern = r'<div>(.*?)<\/div>'
result = re.findall(pattern, html_text, re.DOTALL)
print(result)
I expected this code to return the content of the innermost elements, like this:
['Innermost Content 1', 'Innermost Content 2']
But it's not working as expected. What am I doing wrong with my regex pattern, and how can I fix it to achieve the desired result? Any help would be greatly appreciated!
Try this modified code with changed pattern and an extra line to get rid of the \n
import re
html_text = """
<div>
<div>
<div>
Innermost Content 1
</div>
</div>
<div>
Innermost Content 2
</div>
</div>
"""
pattern = r'<div>([^<]*?)<\/div>'
result = re.findall(pattern, html_text, re.DOTALL)
result = [content.strip() for content in result if content.strip()]
print(result)