I use BeautifulSoup 3.2.1 to parse a lot of HTML files translated with eTranslation.
I found
soup = BeautifulSoup(html_file, "html.parser")
sometimes cuts a section of my HTML file. And it is related to invalid tags or problems found in the HTML.
Also I found soup = BeautifulSoup(html_file, "lxml")
works better in these cases of bad written HTML.
Is there a way to detect which HTML file is invalid using BeautifulSoup?
I image something like this:
if valid(html_file):
soup = BeautifulSoup(html_file, "html.parser")
else:
soup = BeautifulSoup(html_file, "lxml")
Here is what I did. Since BeautifulSoup fixes invalid HTML when parsing comparing it to the original gives an answer if it was valid.
from bs4 import BeautifulSoup
def is_valid_HTML_tag(html_string_to_check: str) -> bool:
soup = BeautifulSoup(html_string_to_check, 'html.parser')
return html_string_to_check == str(soup)
print(is_valid_HTML_tag('<div>valid</div>'))
print(is_valid_HTML_tag('<div>invalid'))
gives
True
False
respectively