Search code examples
xmltcltdom

tcl tdom parsing failed due to special charecters in xml tags


I am trying to remove some special characters which was existed in XML tags, we can use some regsubs or string map function to eliminate XML special chars in tagged text, But It is lengthy/time consuming process because our log file was very huge around ~25 MB.

Is there any special method/tip to eliminate special chars in XML tags

Here is a sample looks like

<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<Customers>
    <Customer>
        <CustomerID>BLAUS</CustomerID>
        <CompanyName>Blauer See Delikatessen</CompanyName>
        <ContactName>Hanna Moos</ContactName>
        **<Region>test<ing</Region>**
    </Customer>
    <Customer>
        <CustomerID>SPLIR</CustomerID>
        <CompanyName>Split Rail Beer & Ale</CompanyName>
        <ContactName>Art raunschweiger</ContactName>
        <Region>WY</Region>
    </Customer>
</Customers>

Thanks Malli


Solution

  • If you mean the ampersand, it is not in a tag, it is in the text that appears between two tags.

    The reason people choose to use XML for data interchange is that it's a standard, and there's lots of software around to handle it. That advantage disappears entirely if you try to use something that's almost XML but not quite.

    By far the best solution is to fix the program that is generating this not-quite-XML.

    If you really can't do that, you'll have to try and repair it, and the way of doing that depends on the nature of the damage. You could for example use any language that supports regular expressions to replace the ampersand in any sequence of characters where the ampersand isn't followed by either '#' or a sequence of alphanumerics and then a semicolon, by "&amp;". However, if the data contains this error, then it means it's been generated carelessly, and so it could contain any number of other errors as well.