Search code examples
iphonecocoansxmlparserbyte-order-marknsxmlparsererrordomain

NSXMLParser and BOM bytes


I'm getting my xml file as a result of a php query from some server. When I print the resulting data to the console I'm getting well-structured xml file. When I try to parse it using NSXMLParser it returns NSXMLParserErrorDomain with code 4 - empty document. I saw that xmls that it couldn't parse have BOM (Byte order mark) sequence right after closing '>' mark of xml header. The question is how to get rid of BOM sequence. I tried to create a string with those BOM bytes like that:

    const   UInt8 bom[3] = {0xEF, 0xBB, 0xBF};
NSString    *bomString = [[NSString alloc] initWithData:[NSData dataWithBytes:(const void *)bom length:3] encoding:NSUTF8StringEncoding];
NSString    *noBOMString = [theResult stringByReplacingOccurrencesOfString:bomString withString:@" "];

but it doesn't work for some reason. There are xmls, that have this sequence after the root element. In this case NSXMLParser parses the xml successfully. Safari ignores those characters. So Xcode debugger. Please help!

Thanks,

Nava


Solution

  • I tried to create a string with those BOM bytes like that:

    const   UInt8 bom[3] = {0xEF, 0xBB, 0xBF};
    NSString    *bomString = [[NSString alloc] initWithData:[NSData dataWithBytes:(const void *)bom length:3] encoding:NSUTF8StringEncoding];
    NSString    *noBOMString = [theResult stringByReplacingOccurrencesOfString:bomString withString:@" "];
    

    but it doesn't work for some reason.

    Make sure you gave the correct encoding when instantiating noBOMString. If the document data was UTF-8, make sure you instantiated the string as UTF-8. Likewise, if the data was UTF-16, make sure you instantiated the string as UTF-16.

    If you pass the wrong encoding, either the string won't instantiate at all (I'm assuming that isn't your problem) or some characters will be wrong. The BOM would be one of these: If the input is UTF-8 and you interpret it as MacRoman or ISOLatin1, it'll appear in the string as three separate characters. These three separate characters won't compare equal to the single character that is the BOM.