Search code examples
pythonhtmlbeautifulsoup

Validate HTML with BeautifulSoup


I use BeautifulSoup 3.2.1 to parse a lot of HTML files translated with eTranslation.

I found soup = BeautifulSoup(html_file, "html.parser") sometimes cuts a section of my HTML file. And it is related to invalid tags or problems found in the HTML.

Also I found soup = BeautifulSoup(html_file, "lxml") works better in these cases of bad written HTML.

Is there a way to detect which HTML file is invalid using BeautifulSoup?

I image something like this:

if valid(html_file):
    soup = BeautifulSoup(html_file, "html.parser")
else:
    soup = BeautifulSoup(html_file, "lxml")

Solution

  • Here is what I did. Since BeautifulSoup fixes invalid HTML when parsing comparing it to the original gives an answer if it was valid.

    from bs4 import BeautifulSoup
    
    
    def is_valid_HTML_tag(html_string_to_check: str) -> bool:
        soup = BeautifulSoup(html_string_to_check, 'html.parser')
        return html_string_to_check == str(soup)
    
    print(is_valid_HTML_tag('<div>valid</div>'))
    print(is_valid_HTML_tag('<div>invalid'))
    

    gives

    True  
    
    False  
    

    respectively