I'm using XML Simple to parse an XML file, the problematic part looks like that:
<textBody>
<title>
<titlePart>
<text>SECTION A <emdash/> HUMAN NECESSITIES</text>
</titlePart>
</title>
</textBody>
<ipcEntry kind="t" symbol="A01" ipcLevel="C" entryType="K" lang="EN">
<textBody>
<title>
<titlePart>
<text>AGRICULTURE</text>
</titlePart>
</title>
</textBody>
</ipcEntry
for some reason XML::Simple completely ignores <text>SECTION A <emdash/> HUMAN NECESSITIES</text>
I guess its because the emdash tag, because <text>AGRICULTURE</text>
is parsed just fine.
I also tried setting the parser by:
$XML::Simple::PREFERRED_PARSER = 'XML::Parser';
still no go. Any idea?
Having a tag whose value includes both text and other tags is called "mixed content". XML::Simple doesn't handle mixed content (not usefully, anyway). In XML::Simple's view of the universe, a tag can contain either text or other tags, not both. That's why it's called "Simple". To quote its docs:
Mixed content (elements which contain both text content and nested elements) will be not be represented in a useful way - element order and significant whitespace will be lost. If you need to work with mixed content, then XML::Simple is not the right tool for your job
You'll have to pick a different XML module. XML::LibXML and XML::Twig are popular choices.
Another possibility would be to get whoever produced the XML to use entities instead of tags to represent characters like a dash. For example, XML::Simple could handle:
<text>SECTION A — HUMAN NECESSITIES</text>
just fine. (—
is an em dash.)