Is it possible to pause and resume parsing using a handler class for XML::SAX::Expat
?
The file is very large, we are reading nodes into memory to render a table. We want to only render one section at a time or we run out of memory. So we need to stop parsing the file, do some things in other parts of the program, then resume on the next page.
I can think of a few ways to achieve this (below) but they all feel like hacks. Is there something native I can use?
Possible options:
The first two are inefficient and the last is messy. Are there better options?
Editing to explain more about the file structure and why alternatives don't work.
Apart from some other data the bulk of the structure is as below.
<DETAILS>
<DETAIL>
<ITEM1>...</ITEM1>
<ITEM2>...</ITEM2>
...
</DETAIL>
<DETAIL>
<ITEM1>...</ITEM1>
<ITEM2>...</ITEM2>
...
</DETAIL>
...
</DETAILS>
For the file in question each <DETAIL>
node is roughly 240 bytes in size, which isn't much but we have over 180,000 of them (this is one of the smaller files that fails to process). LibXML
fails when it hits this structure since it attempts to parse it all into memory (we are limited to a 32-bit system and there are other significant structures in Perl's memory).
After updating to the latest version and some code tweaks, XML::Twig
will parse the document, but I still have the same issue - is it possible to pause and resume later?
I don't control the entire logic flow, so when the main application is ready for the next page it calls my object to get it. I need to be able to output a chunk of data and wait for the next request. This could probably be handled by a fork
but I'm not sure that should be required.
Example showing program flow.
This is a simplification (especially the while loop). The real program has a complex nested structure of document pages which contain multiple objects representing page elements. It is defined by using a web service call and is also data driven so we cannot hard code any assumptions for this.
I can't see how to fit a callback into this - processing must resume after the table to emit the remaining page elements, start a new page, and emitting the first few page elements of that new page before resuming the table.
use strict;
use warnings;
use XML::Twig;
my $table = Table->new('details.xml');
my $table_finished = 0;
while (!$table_finished) {
# emit some data e.g. page header
# ...
# emit the table - 2 data rows per page, for testing
$table_finished = $table->partial_emit(2);
# emit some data e.g. page footer
# ...
}
exit;
package Table;
sub new {
my ($class, $filename) = @_;
my $self = {
'_file' => $filename,
};
bless ($self, $class);
my $sub_ref = $self->can('process_table_row');
$self->{'_twig'} = XML::Twig->new(
twig_handlers => {
'DETAIL' => sub {
$sub_ref->($self, @_),
},
});
return $self;
}
sub partial_emit {
my ($this, $rows) = @_;
$this->{'_rows_emitted'} = 0;
$this->{'_limit'} = $rows;
$this->{'_finished'} = 1;
# we want this to return after parsing part of the file if it is large
$this->{'_twig'}->parsefile($this->{'_file'});
# should be zero if we returned early
return $this->{'_finished'};
}
sub process_table_row {
my ($this, $twig, $elt) = @_;
# increase row count
$this->{'_rows_emitted'}++;
# handle data - doesn't matter what it does here
print $elt->text, "\n";
# we've done as many as we want - how to stop processing and return to main loop?
if ($this->{'_rows_emitted'} >= $this->{'_limit'}) {
print "Limit reached\n";
# Ideally we'd set this, tell Twig to stop for a while, and carry on, but in my test script this causes an infinite loop
#$this->{'_finished'} = 0;
}
}
1;
And another edit... it seems after tweaking my search I stumbled across what I wanted this whole time. XML::SAX::Expat::Incremental
has a parse_more
routine that does exactly what I need. I'll need to wait a few days to test on the full data set but a brief test as below works.
The Table
class can do this:
$self->{'_parser'} = XML::SAX::Expat::Incremental->new( Handler => MyHandler->new($self) );
where MyHandler
is a simple XML::SAX
style handler which now has access to the Table
.
A call to Table::partial_emit
will do something like this:
my $buf;
my $bytes_to_read = 50; # small for testing
while (read($this->{'_fh'}, $buf, $bytes_to_read)) {
$this->{'_parser'}->parse_more($buf);
# MyHandler will increment this based on the number of rows (DETAIL nodes) encountered
if ($this->{'_rows_emitted'} >= $rows) {
$this->{'_finished'} = 0;
last;
}
}
The above probably has some bugs in edge cases but it works fine for my test. I will need to stress test it properly later and see if it's production ready.
After some searching I came across a very useful old thread which details exactly what I need.
http://www.perlmonks.org/?node_id=420383
I can use XML::Parser::ExpatNB
for the behaviour I need. XML::SAX::Expat::Incremental
will wrap this up into a SAX interface if necessary but I don't think I'll bother.
Sample code is below. It performs well enough (faster than XML::Twig
) so I'll be using this.
use strict;
use warnings;
use XML::Parser::Expat;
my $parser = XML::Parser::ExpatNB->new();
$parser->setHandlers('Start' => \&start_element,
'End' => \&end_element,
'Char' => \&char_data);
my $read_size = 64 * 1024; # test to find optimal size
my $file_name = '../details.xml';
my $buf;
open(my $fh, '<', $file_name) or die $!;
binmode($fh);
my $bytes_read;
while ( $bytes_read = read($fh, $buf, $read_size) ) {
$parser->parse_more($buf);
}
$parser->parse_done();
die "Error: $!" unless defined($bytes_read);
close($fh);
I have omitted the handlers, they are typical for this kind of approach ($_[0]
is the XML::Parser::ExpatNB
object which includes the current context, $_[1]
is the data e.g. node name or character data).
XML::LibXML::Reader
also works as shown below, I didn't entirely understand the interface earlier. It is slower on my machine though, and the node handling required is a bit more complex (e.g. CDATA is not automatically returned as text), so I will avoid it for now.
my $reader = XML::LibXML::Reader->new(location => $file_name) or die $!;
while ($reader->read) {
processNode($reader);
}