Search code examples
phpxmlxmlreader

Writing a XML linter in PHP, but both XMLReader and XML parser can't handle parsing error


I'm tasked to write a XML linter in PHP8 and it shall server as a web API. This XML linter must work in verbose mode that goes through the whole document and log every error found (up to 1000 errors) with line number (yes I know XML can one single-line but it's a mandatory requirement).

In other words, I need a XML reader/parser module that can:

  1. [mandatory] process medium to large size XML documents (100MB~1GB).
  2. [mandatory] surpass error and keep parsing, if possible.
  3. [mandatory] write my own checker code to validate the value of TEXT node.
  4. [mandatory] get line number of current node.

But after some study, none of the PHP built-in XML extensions can satisfy these requirements.

For example here is a "bad" XML that the closing tags at line 5 (<AuthorityCode>...</Authority>) & line 11 (<LastUpdateTime>...</LastUpdate>) mismatches with starting tags:

<?xml version="1.0"?>
<FacilityList>
    <UpdateTime>2022-09-09T08:00:00+08:00</UpdateTime>
    <UpdateInterval type="SEMIAUTO">-1</UpdateInterval>
    <AuthorityCode>CA</Authority>
    <Facility>
        <FacilityID>NFB-NR-P00501-013037-SN-S9K6VPJ36-0002</FacilityID>
        <FacilityClass>01</FacilityClass>
        <FacilityType>003</FacilityType>
        <LocationType>1</LocationType>
        <LastUpdateTime>2022-10-04T13:00:00+08:00</LastUpdate>
    </Facility>
</FacilityList>

The xmllint tool from libxml will show all errors at line 5 and line 11, but both XMLReader and XML Parser will just stop at line 5 and won't go further, and I can't find a way to bypass it. Yes I've already set the XML_PARSE_RECOVER flag in XMLReader:

libxml_use_internal_errors(true);   
$parser = new XMLReader();
$parser->open($filename,null,LIBXML_NOERROR|LIBXML_NOWARNING|1);

And it doesn't work (PHP 8.2.6).

Did I do something wrong, or it's just not possible to do what I wanted using built-in XMLReader / XML expat parser ? The DOMDocument can process and report both errors, but I don't want to load the whole 1GB data into memory.

[EDIT] No I'm not asking for a 3rd party products but just want to know what should I do with PHP built-in functions. Like some sort of magic options in XMLReader / XML expat parser, or example codes to make DOMDocument parsing based on partial data from a streaming source. Or at least just tell me that "you can't do this in PHP".

I've already checked many 3rd party libraries but none of them can do what I wanted. They either just provide a wrapper of XML expat parser, or relies on DOMDocument to load everything into memory in the beginning.

=====

BTW, is there any reliable way to get line number from XMLReader ? Yes I know the XMLReader::expand() trick but it just doesn't work when the XML is badly formatted (such as mission closing tag).

Trying to count the number of \n and \r by myself doesn't work either, because XMLReader doesn't report anything before <FacilityList>: the <?xml version="1.0"?> and the following whitespace are totally ignored.


Solution

  • OK from the comments from other people, the answer for my question seems to be "NO YOU CAN'T DO THAT IN PHP".