Search code examples
xmlpretty-print

pretty-printing malformed xml


I am working on a data migration and I am parsing and exporting html into xml. The html gets escaped, of course, when it goes into the xml, but to verify that parsing is happening properly, I am decoding the brackets to get readable html tags inside the xml. However, the tags are all run-together, and it's still not very readable.

Is there something that can simply indent the tag structure that I have? It's neither valid xml nor html. I've tried xmllint --format and xmllint --htmlout, but both of those choke at different points.

Can I avoid doing this by hand?

Here is a small example:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<result><node><title>This would be the title</title><uri>/path/filename.jpg</uri><alt>Alt tag data</alt><body><p>Some text goes here.</body></node></result>

In the actual data, the html tags inside <body> are all escaped to &lt; and &gt;, but that was too difficult to eyeball to see if the parsing worked correctly. So I changed them to their bracket equivalents with a find and replace. But they are still not indented, so it is difficult to read.

Both tidy and xmllint complain about the missing closing <p> tag. In this data, there are a number of missing or mis-matched tags. I understand that this is not valid html or xml, but the cleanup of the html we'll do later, at this point I just have to make sure that the html is getting parsed at the right places, which is difficult to see when there are no line breaks or indentation.

To fix the above example, I could remove or close the <p> tag manually, but in the actual data, there is a lot of brokenness, and it would be a non-trivial task to fix tags just to get it to parse for formatting. At this phase I am trying to avoid manual massaging and do things in an automated manner.

For example, for this one file, tidy reports 65 warnings and 778 errors. Fixing them all by hand would be a waste of time -- I might as well start indenting myself. I need something that can indent in a non-strict manner, and is not going to care about unmatched tags.


Solution

  • I used the formatting function that user Josh Leitzel posted here. Not perfect, but good enough.