Is there a way to parse html tags with Python?

I need to parse several HTML tags.

Example: I need to turn:

<div class="title">
     <h1> Hello World </h1>
</div>

into

['<div class="title">', '<h1> Hello World </h1>', '</div>']

Solution

You can use a recursive generator function with BeautifulSoup:

import bs4
from bs4 import BeautifulSoup as soup
s = """
<div class="title">
   <h1> Hello World </h1>
</div>
"""
def get_tags(d):
   ats = " ".join(a+"="+f'"{(b if not isinstance(b, list) else " ".join(b))}"' for a, b in d.attrs.items())
   h = f'<{d.name} {ats}>' if ats else f'<{d.name}>'
   if (k:=[i for i in d.contents if isinstance(i, bs4.element.Tag)]):
      yield h
      yield from [j for l in k for j in get_tags(l)]
      yield f'</{d.name}>'
   else:
      yield f'{h}{d.text}</{d.name}>'

print(list(get_tags(soup(s, 'html.parser').contents[1])))

Output:

['<div class="title">', '<h1> Hello World </h1>', '</div>']