Search code examples
iosxmlnsxmlparser

Can I leave some sections unparsed using NSXMLParser?


I have an XML document which I want to parse using NSXMLParser. One of the tags it can contain is <html>, and in my parsed representation I want the contents of that tag, verbatim. However, when I parse the document, my delegate methods are called for the start, end and contents of each tag inside the html tag.

I can't get the provider of the document to add CDATA tags; nor can I use something other than NSXMLParser to parse the document.

Is there a way for me to tell the parser to treat the contents of HTML tags as CDATA and to leave them unparsed, even if they contain other tags?


Solution

  • That's too bad that the owner of the XML feed won't fix it because, depending on the HTML, you may end up with a malformed XML feed. If it really is an XML document, they definitely should wrap it in a CDATA or replace all the < with &lt; and all the > with &gt;.

    Frankly, if all you need is the HTML, and all you have is XML tag that contains the HTML without the CDATA or appropriate character replacement, I might not be inclined to try to run it through NSXMLParser at all (because the successful parsing is contingent on the nature of the HTML included). I'd use a NSScanner or NSRegularExpression to extract all of the text between the XML's opening and closing tag that wrap your HTML.

    Or, if you really want to use NSXMLParser (because there's other stuff in addition to the HTML that you need), then manually alter the NSData, wrapping the HTML in a CDATA yourself.

    If, on the other hand, the document you're trying to parse really isn't XML, but rather is just HTML, then of course, you shouldn't be parsing it with an XML parser. You should be using a HTML parser, like HPPLE, as described in Galloway's article, How to Parse HTML on iOS on the Ray Wendlich site.