Search code examples
cxmlxml-parsingexpat-parser

Expat (C) - "invalid token" for (nearly) every line


I have some XML I am trying to process with Expat in C. The XML can be parsed in Java so I have no reason to believe it is malformed. Further, the C code I have will parse a string literal I plug in "by hand" - but it fails to parse my XML file.

This is the code (with stuff I've added - if God wanted us to use debuggers he wouldn't have given us printf):

static void XMLCALL
starthandler(void *data, const XML_Char *name, const XML_Char **attr)
{
int i;
if (strcmp(name, "file") == 0) {
    for (i = 0; attr[i]; i += 2) {
        if (strcmp(attr[i], "path") == 0) {
            printf("File is at %s\n", attr[i + 1]);
        }
    }
}
}       

int main(int argc, char *argv[])
{
FILE* inXML;
ssize_t read;
char* line;
size_t len = 0;

XML_Parser p_ctrl = XML_ParserCreate("UTF-8");
if (!p_ctrl) {
    fprintf(stderr, "Could not create parser\n");
    exit(-1);
}

XML_SetStartElementHandler(p_ctrl, starthandler);
inXML = fopen(argv[1], "r");
if (inXML == NULL) {
    fprintf(stderr, "Could not open %s\n", argv[1]);
    XML_ParserFree(p_ctrl);
    exit(-1);
}

while ((read = getline(&line, &len, inXML)) != -1) {
    printf("Line is %s", line);
    enum XML_Status status = XML_Parse(p_ctrl, line, len, 0);
    if (status == 0) {
        enum XML_Error errcde = XML_GetErrorCode(p_ctrl);
        printf("ERROR: %s\n", XML_ErrorString(errcde));
        printf("Error at column number %lu\n",    XML_GetCurrentColumnNumber(p_ctrl));
        printf("Error at line number %lu\n", XML_GetCurrentLineNumber(p_ctrl));
    }
    free(line);
    line = NULL;
    len = 0;
}

XML_ParserFree(p_ctrl);
fclose(inXML);
return 0;
} 

This is the XML file I am attempting to parse:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE threadrecordml [
<!ELEMENT threadrecordml (file)*>
<!ATTLIST threadrecordml version CDATA #FIXED "0.1">
<!ATTLIST threadrecordml xmlns CDATA #FIXED "http://cartesianproduct.wordpress.com">
<!ELEMENT file EMPTY>
<!ATTLIST file thread CDATA #REQUIRED>
<!ATTLIST file path CDATA #REQUIRED>
]>
<threadrecordml xmlns="http://cartesianproduct.wordpress.com">
<file thread="1" path="tester_1.xml" />
<file thread="3" path="tester_3.xml" />
<file thread="2" path="tester_2.xml" />
<file thread="4" path="tester_4.xml" />
<file thread="5" path="tester_5.xml" />
<file thread="6" path="tester_6.xml" />
<file thread="7" path="tester_7.xml" />
<file thread="8" path="tester_8.xml" />
<file thread="9" path="tester_9.xml" />
<file thread="10" path="tester_10.xml" />
<file thread="11" path="tester_11.xml" />
<file thread="12" path="tester_12.xml" />
<file thread="13" path="tester_13.xml" />
<file thread="14" path="tester_14.xml" />
<file thread="15" path="tester_15.xml" />
<file thread="16" path="tester_16.xml" />
<file thread="17" path="tester_17.xml" />
<file thread="18" path="tester_18.xml" />
</threadrecordml>

This is a sample of the output...

adrianm@imola:/n/staffstore/adrianm/optGenC$ ./optgenc ../tester_control.xml 
Line is <?xml version="1.0" encoding="UTF-8" standalone="no"?>
ERROR: not well-formed (invalid token)
Error at column number 0
Error at line number 2
Line is <!DOCTYPE threadrecordml [
ERROR: not well-formed (invalid token)
Error at column number 0
Error at line number 3
Line is <!ELEMENT threadrecordml (file)*>
ERROR: not well-formed (invalid token)
Error at column number 0
Error at line number 4
Line is <!ATTLIST threadrecordml version CDATA #FIXED "0.1">
ERROR: not well-formed (invalid token)
Error at column number 0

(For all lines)

If I "cheat" and add this line after the read...

line = "<file thread=\"1\" path=\"tester.xml\" />";

The line will be parsed (the code of course then breaks for other reasons).

So there would appear to be some mangling going on in the read from the disk file... is this being read as 16 bit perhaps? But changing the encoding of the parser to either NULL or UTF-16 seems to make no difference.

Can anyone offer an explanation? (If it makes any difference I have run this code on both 64 bit OSX and Linux boxes and had the same problem)


Solution

  • The answer is that getline(...) adds in a null character after the newline, this is then passed to the parser, but of course it's not valid XML so it causes a failure - and as it is after the newline, this is recorded as being on line 2 and so on.

    Doing this fixes the problem:

    do {
        len = fread(data, 1, sizeof(data), inXML);
        done = len < sizeof(data);
    
        if (XML_Parse(p_ctrl, data, len, 0) == 0) {
            enum XML_Error errcde = XML_GetErrorCode(p_ctrl);
            printf("ERROR: %s\n", XML_ErrorString(errcde));
            printf("Error at column number %lu\n", XML_GetCurrentColumnNumber(p_ctrl));
            printf("Error at line number %lu\n", XML_GetCurrentLineNumber(p_ctrl));
        }
    } while(!done);