Search code examples
xmllinuxperlxml-libxml

XML::LibXML::Reader need warn on schema errors instead of exit


Basically I need to use the schema option from the perl module XML::libXML::Reader in order to validate a large (>1GB) XML file as the file is parsed.

Previously I have used the xmllint command to validate an XML file against a given schema (xsd) file. However now I have some large XML files to validate and am running out of memory (8GB) trying to perform the validation.

I have read on the XML::libXML::Reader perl module page that there is a schema option. However, when I use it (see code below) the code exits when the first invalidate element of the XML file is found.

use strict;
use warnings;
use XML::LibXML::Reader;

my $SchemaFile='schema.xsd';
my $FileToAnalyse='/tmp/file.xml';

my $reader = XML::LibXML::Reader->new(location => $FileToAnalyse,Schema=>$SchemaFile) or 
die "cannot read file '$FileToAnalyse': $!\n";

while($reader->read) {

    Process the file line by line here, even if not valid against schema (reduces memory usage for large files)
}

I need to collect the invalid entries and continue rather than exiting. Is this possible?


Solution

  • The reason $reader->read does not recover from schema validation errors (even if recovery could be possible) can be seen at line #8815 of LibXML.xs. Notice that REPORT_ERROR() is called with a zero value (the value indicates whether `LibXML_report_error_ctx() will be able to recover from errors or not. A value of zero, means it will not try to recover, and it will call XML::LibXML::Error::_report_error to die.

    I tried to change the value to 1 at line #8815 and recompiled the XS module, and now it reported the schema errors as warnings (instead of dying) and continued the parsing.

    I guess there is a good reason why this option is not made available to the user, but I am not so familiar with XML parsing that I can give an example of what could go wrong here.

    Edit:

    It seems that the correct approach is to catch the exceptions thrown by read(), then try to call read() another time, if the following call to read() returns -1, the parser was not able to recover from the error, if it returns 0, end-of-file was reached, and if it returns 1 it was able to recover from the exception. I did some testing and it seems it is able to recover from schema validation errors, but not from parsing errors. So you could try the following:

    use feature qw(say);
    use strict;
    use warnings;
    
    use Try::Tiny qw(try catch);
    use XML::LibXML::Reader;
    
    my $SchemaFile='schema.xsd';
    my $FileToAnalyse='file.xml';
    my $reader = XML::LibXML::Reader->new(
        location => $FileToAnalyse, Schema => $SchemaFile
    ) or die "cannot read file '$FileToAnalyse': $!\n";
    while (1) {
        my $result;
        try { $result = $reader->read } catch {
            say '==> ' . $_;
            $result = 1;  # Try to continue after exception..
        };
        last if $result != 1;
        if ( $reader->nodeType == XML_READER_ELEMENT ) {
            say "Element node: ", $reader->name;
        }
    }
    $reader->finish();
    $reader->close();