Search code examples
pythonxmldomexpat-parser

Python xml.dom and bad XML


I'm trying to extract some data from various HTML pages using a python program. Unfortunately, some of these pages contain user-entered data which occasionally has "slight" errors - namely tag mismatching.

Is there a good way to have python's xml.dom try to correct errors or something of the sort? Alternatively, is there a better way to extract data from HTML pages which may contain errors?


Solution

  • You could use HTML Tidy to clean up, or Beautiful Soup to parse. Could be that you have to save the result to a temp file, but it should work.

    Cheers,