Search code examples
regexxmlsymfonyweb-crawlerdomcrawler

Symfony2 - DomCrawler - fetch element's content by it's neighbour content in regex


I have this xml:

<Item id="3" idLevel="3">
    <Label qualifier="Usual">
        <LabelText language="ALL">BE01</LabelText>
    </Label>
    <Label qualifier="Usual">
        <LabelText language="EN">R&#xc9;GION DE BRUXELLES-CAPITALE / BRUSSELS HOOFDSTEDELIJK GEWEST</LabelText>
    </Label>
</Item>
<Item id="4" idLevel="3">
    <Label qualifier="Usual">
        <LabelText language="ALL">BE001</LabelText>
    </Label>
    <Label qualifier="Usual">
        <LabelText language="EN">VLAAMS GEWEST</LabelText>
    </Label>
</Item>
<Item id="123" idLevel="3">
    <Label qualifier="Usual">
        <LabelText language="ALL">RO001</LabelText>
    </Label>
    <Label qualifier="Usual">
        <LabelText language="EN">MACROREGIUNEA DOI</LabelText>
    </Label>
</Item>

I would like to fetch a value of a <LabelText language="EN"> where the neighbour <LabelText language="ALL"> starts with "BE" and has 3 numbers after.

In this case I would get a value of a second xml element in example: VLAAMS GEWEST

I have an idea how to approach it in uggly way, but I believe there should be more flexible and elegant way to do it:

$crawler = new Crawler();
$crawler->addXmlContent($xml);
$crawler = $crawler->filterXPath('//Item[@idLevel="3"]');

foreach ($crawler as $domElement) {
    // here I check if inside element's neighbour has value of "BE" and three numbers after with regex
}

Is there a way to handle it with DomCrawler instead of iterating all elements and checking each?


Solution

  • You may use a single XPath expression that will get just your required text:

    //Item[@idLevel="3"]/Label[string-length(preceding-sibling::Label/LabelText/text()) = 5 and starts-with(preceding-sibling::Label/LabelText/text(), "BE") and number(substring(preceding-sibling::Label/LabelText/text(), 3)) = number(substring(preceding-sibling::Label/LabelText/text(), 3))]/LabelText[@language="EN"]/text()
    

    Breaking it down:

    • //Item[@idLevel="3"] - gets the Item nodes with idLevel attribute with value 3
    • /Label - its Label children that have...
    • [string-length(preceding-sibling::Label/LabelText/text()) = 5 - a sibling Label/LabelText nodes with text length equal to 5...
    • and starts-with(preceding-sibling::Label/LabelText/text(), "BE") - and having text starting with BE
    • and number(substring(preceding-sibling::Label/LabelText/text(), 3)) = number(substring(preceding-sibling::Label/LabelText/text(), 3))] - and the last 3 chars are digits
    • /LabelText[@language="EN"]/text() - get the text of the LabelText node with a language attribute having text EN