Search code examples
javaxmlxml-parsingcdata

XML parser: quick way to parse everything inside an element as text?


XML parser: quick way to parse everything inside an element as text? e.g.,

<foo>
    text
    <bar>
        <![CDATA[ ... ]]>
        <hello><world></world>
        </hello> 
    </bar
    text
</foo>

How to parse everything inside the <foo> as text as if it is wrapped with <![CDATA[ ]]>.

The reason for this is it has nested <![CDATA[ ]]>.. Nesting CDATA makes it harder to read/write by human.

SAXParser/StreamParser use callbacks or events, and elements/attributes/texts need to be concatenated together. Is there a more efficient way?

Is there a way to tell XML parser to treat everything inside an element as TEXT? can xml schema do it?


Solution

  • You ask ...

    Is there a way to tell XML parser to treat everything inside an element as TEXT? can xml schema do it?

    ... by which you seem to mean

    How to parse everything inside the <foo> as text as if it is wrapped with <![CDATA[ ]]>.

    . That would mean that you get a result String with this literal content:*

        text
        <bar>
            <![CDATA[ ... ]]>
            <hello><world></world>
            </hello> 
        </bar
        text
    

    That is, including all whitespace and a text representation of all tags, comments, processing instructions, and CDATA sections. I'm having trouble aligning that with ...

    The reason for this is it has nested <![CDATA[ ]]>.. Nesting CDATA makes it harder to read/write by human.

    . That is, I don't disagree about CDATA sections being hard-ish to read, but I don't see how that suggests creating strings that contain textual versions of CDATA sections.

    I'm also having trouble squaring that with ...

    SAXParser/StreamParser use callbacks or events, and elements/attributes/texts need to be concatenated together. Is there a more efficient way?

    ... because a SAXParser won't allow you to do what you asked. Although you could use SAX callbacks to reconstruct a text representation of the XML that has been parsed, that would not

    • reproduce CDATA sections as text (only their contents),
    • ensure that tags are reproduced verbatim (internal whitespace, type of attribute quotation marks, and representation style for empty tags are not conveyed by SAX),
    • avoid expanding at least some of the entity references in text and attribute values (but outside CDATA sections), or
    • preserve comments.

    There might be other issues. And if your goal happens to be to perform some kind of XML cleanup to improve human readability then note well that there is a fairly high risk that the string emitted by such a process will not constitute well-formed XML element content. After all, one of the purposes of CDATA sections and many entity references is exactly to express content that would render the document ill-formed if it were expressed literally. That escaping function would be lost, so you would need to re-escape the content if you wanted to be able to re-parse it as XML.


    If you really do want to extract the raw text of an XML element then no, no XML API I know of provides for it, and XML Schema cannot do it. XML technologies work with document structure, but you seem to want to override document structure. You'll probably need to write your own XML parser if that's in fact what you want to do.


    * Or at least, that's my best guess at what you mean. CDATA sections do not nest, so if you literally wrapped the CDATA-including raw element text inside a CDATA section then the XML interpretation of the result would be different, and not likely what you're really after.