Search code examples
javaxmlparsingtableofcontents

Load xml file with not escaped quotes


I have a xml file with table of content. The problem is, that it contains not escaped quotes. How can I load the file and repair this quotes?

<?xml version="1.0" encoding="UTF-8"?>
<?NLS TYPE="org.eclipse.help.toc"?>

<topic label="Main Topic" href="0.2.1.html#0.2.5">
    <topic label="Topic "Sales"" href="0.2.1.html#2.12.3.6"/>
</topic>

I know that in the standard stays:

In the content of elements, character data is any string of characters which does not contain the start-delimiter of any markup

The source doesn't escape the quotes and I cannot change the source. How can I repair the xml file locally?


Solution

  • Don't call it XML when it isn't.

    If you want to process this file you'll need to discover what rules (grammar) it conforms to, and write a parser for that grammar. This may be rather difficult; I suspect the grammar, when you discover it, will be ambiguous and will require infinite lookahead to resolve.