Search code examples
pythonhtmlparsingbeautifulsoup

BeautifulSoup shuffles the attributes of html tags


I have an issue with BeautifukSoup. Whenever I parse an HTML input, it changes the order of the attributes (e.g. class, id) of the HTML tags.

For example:

from bs4 import BeautifulSoup

tags = BeautifulSoup('<span id="100" class="test"></span>', "html.parser")
print(str(tags))

Prints:

<span class="test" id="100"></span>

As you can see, the class and id order was changed. How can I prevent such behavior?

I am unfamiliar with web development, but I know that the order of the attributes doesn't matter.

My main goal here is to preserve the original shape of the HTML input after parsing it because I want to loop through the tags and match them (at character-level) with other HTML texts.


Solution

  • As you stated, the order of attributes in HTML doesn't matter. But if you really want unsorted attributes, you can do:

    from bs4 import BeautifulSoup
    from bs4.formatter import HTMLFormatter
    
    
    class UnsortedAttributes(HTMLFormatter):
        def attributes(self, tag):
            yield from tag.attrs.items()
    
    
    tags = BeautifulSoup('<span id="100" class="test"></span>', "html.parser")
    
    print(tags.encode(formatter=UnsortedAttributes()).decode())
    

    Prints:

    <span id="100" class="test"></span>
    

    EDIT: To not close void tags you can try:

    class UnsortedAttributes(HTMLFormatter):
        def __init__(self):
            super().__init__(
                void_element_close_prefix=""
            )  # <-- use void_element_close_prefix="" here
    
        def attributes(self, tag):
            yield from tag.attrs.items()
    
    
    tags = BeautifulSoup(
        """<input id="NOT_CLOSED_TAG" type="Button">""",
        "html.parser",
    )
    
    print(tags.encode(formatter=UnsortedAttributes()).decode())
    

    Prints:

    <input id="NOT_CLOSED_TAG" type="Button">