Search code examples
jsonxmlampersand

Replacing & in invalid XML's


I have a requirement to convert an XML to JSON, parse the JSON and save it in the database as is (as it came in the incoming XML). The incoming XML's have data with both & and its HTML equivalent &. To save such XML's, I tried replacing the & with their HTML equivalent, but that messes up things when I want to try to revert to the original data in the XML before saving them in the database. Any input on how this can be done will be appreciated.


Solution

  • First try to establish whether the bug can be fixed at source: find out how the (non-)XML was generated, fix the program that created it, and then regenerate the data correctly.

    If you have no alternative other than repairing the corrupt data, first investigate it so that you understand exactly what corruptions you are dealing with. In particular, establish all the patterns of data that use an ampersand both correctly and incorrectly.

    Then use a text-based tool (not an XML-based tool) such as sed or perl to match the patterns you have discovered and correct them.

    But treat this as a one-off and don't let it become normal. You wouldn't accept faulty goods from your suppliers, why should you accept faulty XML?