Search code examples
xmlperlxml-twig

Perl XML::twig : Find a substring located before a child element in mixed content


I'm working on a XML file with some mixed content (elements containing text, one child tag, then text again).
I would like to extract, for each parent element, the word (substring) coming right before the child element.

Example of XML Input :

<root>
<parent> there is text all <child>text</child> around it</parent>
<parent> there is text all <child>text</child> around it</parent>
<parent> there is text all <child>text</child> around it</parent>
<parent> there is text all <child>text</child> around it</parent>
</root>

Example of Text Output :

all
all
all
all

I know that applying text_only to the parent element will give me there is text all around it, so I don't have to deal with the child element anymore, but then I don't know how to locate the preceding word.

Should I replace the child element by some kind of textual marker like | and just go through the remaining text as a single string ?

I'm not asking for a full "ready-made" answer, but some directions would sure be helpful.


Solution

  • You can find each child element and then check the text of its sibling on the left. That's the previous sibling. Conveniently there is a method prev_sibling_text that gives you just that, since the previous sibling is a text node anyway. From there, it's just a matter of locating the last word.

    use strict;
    use warnings;
    use feature 'say';
    use XML::Twig;
    
    my $twig = XML::Twig->new(
        TwigHandlers => {
            child => sub {
                say +( split /\s/, $_->prev_sibling_text )[-1];
            },
        }
    );
    
    $twig->parse( \*DATA );
    
    __DATA__
    <root>
    <parent> there is text all <child>text</child> around it</parent>
    <parent> there is text all <child>text</child> around it</parent>
    <parent> there is text all <child>text</child> around it</parent>
    <parent> there is text all <child>text</child> around it</parent>
    </root>