Python xml.dom and bad XML

I'm trying to extract some data from various HTML pages using a python program. Unfortunately, some of these pages contain user-entered data which occasionally has "slight" errors - namely tag mismatching.

Is there a good way to have python's xml.dom try to correct errors or something of the sort? Alternatively, is there a better way to extract data from HTML pages which may contain errors?

Solution

You could use HTML Tidy to clean up, or Beautiful Soup to parse. Could be that you have to save the result to a temp file, but it should work.

Cheers,