Search code examples
pythonhtml-parsing

Is there a way to parse html tags with Python?


I need to parse several HTML tags.

Example: I need to turn:

<div class="title">
     <h1> Hello World </h1>
</div>

into

['<div class="title">', '<h1> Hello World </h1>', '</div>']

Solution

  • You can use a recursive generator function with BeautifulSoup:

    import bs4
    from bs4 import BeautifulSoup as soup
    s = """
    <div class="title">
       <h1> Hello World </h1>
    </div>
    """
    def get_tags(d):
       ats = " ".join(a+"="+f'"{(b if not isinstance(b, list) else " ".join(b))}"' for a, b in d.attrs.items())
       h = f'<{d.name} {ats}>' if ats else f'<{d.name}>'
       if (k:=[i for i in d.contents if isinstance(i, bs4.element.Tag)]):
          yield h
          yield from [j for l in k for j in get_tags(l)]
          yield f'</{d.name}>'
       else:
          yield f'{h}{d.text}</{d.name}>'
    
    print(list(get_tags(soup(s, 'html.parser').contents[1])))
    

    Output:

    ['<div class="title">', '<h1> Hello World </h1>', '</div>']