I have an issue with BeautifukSoup. Whenever I parse an HTML input, it changes the order of the attributes (e.g. class, id) of the HTML tags.
For example:
from bs4 import BeautifulSoup
tags = BeautifulSoup('<span id="100" class="test"></span>', "html.parser")
print(str(tags))
Prints:
<span class="test" id="100"></span>
As you can see, the class
and id
order was changed. How can I prevent such behavior?
I am unfamiliar with web development, but I know that the order of the attributes doesn't matter.
My main goal here is to preserve the original shape of the HTML input after parsing it because I want to loop through the tags and match them (at character-level) with other HTML texts.
As you stated, the order of attributes in HTML doesn't matter. But if you really want unsorted attributes, you can do:
from bs4 import BeautifulSoup
from bs4.formatter import HTMLFormatter
class UnsortedAttributes(HTMLFormatter):
def attributes(self, tag):
yield from tag.attrs.items()
tags = BeautifulSoup('<span id="100" class="test"></span>', "html.parser")
print(tags.encode(formatter=UnsortedAttributes()).decode())
Prints:
<span id="100" class="test"></span>
EDIT: To not close void tags you can try:
class UnsortedAttributes(HTMLFormatter):
def __init__(self):
super().__init__(
void_element_close_prefix=""
) # <-- use void_element_close_prefix="" here
def attributes(self, tag):
yield from tag.attrs.items()
tags = BeautifulSoup(
"""<input id="NOT_CLOSED_TAG" type="Button">""",
"html.parser",
)
print(tags.encode(formatter=UnsortedAttributes()).decode())
Prints:
<input id="NOT_CLOSED_TAG" type="Button">