I need to parse several HTML tags.
Example: I need to turn:
<div class="title">
<h1> Hello World </h1>
</div>
into
['<div class="title">', '<h1> Hello World </h1>', '</div>']
You can use a recursive generator function with BeautifulSoup
:
import bs4
from bs4 import BeautifulSoup as soup
s = """
<div class="title">
<h1> Hello World </h1>
</div>
"""
def get_tags(d):
ats = " ".join(a+"="+f'"{(b if not isinstance(b, list) else " ".join(b))}"' for a, b in d.attrs.items())
h = f'<{d.name} {ats}>' if ats else f'<{d.name}>'
if (k:=[i for i in d.contents if isinstance(i, bs4.element.Tag)]):
yield h
yield from [j for l in k for j in get_tags(l)]
yield f'</{d.name}>'
else:
yield f'{h}{d.text}</{d.name}>'
print(list(get_tags(soup(s, 'html.parser').contents[1])))
Output:
['<div class="title">', '<h1> Hello World </h1>', '</div>']