Search code examples
pythonhtmlpython-3.xbeautifulsouphtml-parsing

python: get opening and closing html tags


Question:

How can I find the text for all opening and closing HTML tags with python (3.6). This needs to be the exact text, keeping spaces and potentially illegal html:

# input
html = """<p>This <a href="book"> book </a  > will help you</p attr="e">"""

# desired output
output = ['<p>', '<a href="book">', '</a  >', '</p attr="e">']

Attempt at solution:

Apparently this is not possible in Beautifulsoup, this question: How to get the opening and closing tag in beautiful soup from HTML string? links to html.parser

Implementing a custom parser is easy. You can use self.get_starttag_text() to get the text corresponding to the last opened tag. But for some reason, there is no analogous method get_endtag_text().

Which means that my parser produces this output:

class MyHTMLParser(HTMLParser):
    def __init__(self):
        super().__init__()
        self.tags = []

    def reset_stored_tags(self):
        self.tags = []
    def handle_starttag(self, tag, attrs):
        self.tags.append(self.get_starttag_text())

    def handle_endtag(self, tag):
        self.tags.append(self.get_endtag_text())

    def handle_startendtag(self, data):
        self.tags.append(self.get_starttag_text())
# input
input_doc = """<p>This <a href="book"> book </a> will help you</p>"""

parser = MyHTMLParser()
parser.feed(input_doc)

print(parser.tags)
# ['<p>', '<a href="book">', '<a href="book">', '<a href="book">']

The tag argument of the handle_endtag is just a string "a" or "p", not some custom datatype that can provide the whole tag.


Solution

  • While the answer from @Ajax1234 contains some nice python + beautifulsoup, I found it to be very unstable. Mostly because I need the exact string of the html tag. Each tag found by the method must be present in the html text. This leads to the following problems:

    • It parses the tag names and attributes from HTML and plugs them together to form the string of the tag yield f'<{_d.name}>' if not _attrs else f'<{_d.name} {_attrs}>'. This gets rid of extra whitespace in the tag: <p > becomes <p>

    • It always generates a closing tag, even if there is none in the markup

    • It fails for attributes that are lists: <p class="a b"> becomes <p class="[a, b]">

    The whitespace problem can be partially solved by cleaning the HTML prior to processing it. I used bleach, but that can be too aggressive. Notably, you have to specify a list of accepted tags before you use it.

    A better approach is a thin wrapper around html.parser.HTMLParser. This is something I already started in my question, the difference here is that I automatically add generate a closing tag.

    from html.parser import HTMLParser
    
    class MyHTMLParser(HTMLParser):
        def __init__(self):
            super().__init__()
            self.tags = []
    
        def handle_starttag(self, tag, attrs):
            self.tags.append(self.get_starttag_text())
    
        def handle_endtag(self, tag):
            self.tags.append(f"</{tag}>")
    
    parser = MyHTMLParser();
    parser.feed("""<p > Argh, whitespace and p is not closed </a>""")
    parser.tags # ['<p >', '</a>']
    

    This solved the problems mentioned above, but it has one shortcoming, it doesn't look at the actual text for the closing tag. If there are extra arguments or whitespace in the closing tag, the parsing will not show them.