I'm working on a XML file with some mixed content (elements containing text, one child tag, then text again).
I would like to extract, for each parent element, the word (substring) coming right before the child element.
<root>
<parent> there is text all <child>text</child> around it</parent>
<parent> there is text all <child>text</child> around it</parent>
<parent> there is text all <child>text</child> around it</parent>
<parent> there is text all <child>text</child> around it</parent>
</root>
all
all
all
all
I know that applying text_only
to the parent
element will give me there is text all around it
, so I don't have to deal with the child element anymore, but then I don't know how to locate the preceding word.
Should I replace the child
element by some kind of textual marker like |
and just go through the remaining text as a single string ?
I'm not asking for a full "ready-made" answer, but some directions would sure be helpful.
You can find each child
element and then check the text of its sibling on the left. That's the previous sibling. Conveniently there is a method prev_sibling_text
that gives you just that, since the previous sibling is a text node anyway. From there, it's just a matter of locating the last word.
use strict;
use warnings;
use feature 'say';
use XML::Twig;
my $twig = XML::Twig->new(
TwigHandlers => {
child => sub {
say +( split /\s/, $_->prev_sibling_text )[-1];
},
}
);
$twig->parse( \*DATA );
__DATA__
<root>
<parent> there is text all <child>text</child> around it</parent>
<parent> there is text all <child>text</child> around it</parent>
<parent> there is text all <child>text</child> around it</parent>
<parent> there is text all <child>text</child> around it</parent>
</root>