Search code examples
htmlregexxmlxpathxidel

Why is my XPath with regex failing to match?


I would like to use Xidel to select a <section> tag with class="body" if contains a date in format YYYY.M(M).D(D) to find and extract one specific string which has 8 characters and can contain characters and digits.

Sample input HTML:

<section class="body">
Start 2019.1.12

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.

thi1te_t

</section>

Command:

xidel -s input.html -e "//*[@class='body' and contains(text(),'(20\d{2}).(\d{1,2}).(\d{1,2})')]"

For some reason I can't get this regex to work. On regex101.com it works fine.

I would like to get thi1te_t in the final output, probably with regex ^.{8}$ and grep.


Solution

  • Use matches(), which matches against a regex, rather than contains(), which tests for literal substring containment.

    I'd also suggest using . rather than text() as it's the string value of the element that's your real goal to match, not really a text() node child.


    Altogether, the XPath for selecting the targeted element would be:

    //*[@class='body' and matches(text(),'(20\d{2}).(\d{1,2}).(\d{1,2})')]
    

    I would like to get thi1te_t in the final output, probably with regex ^.{8}$ and grep.

    You can return that substring by tokenizing the string value of the element matched by the above XPath and then selecting the line that matches your target regex:

    tokenize(//*[@class='body' and matches(text(),'(20\d{2}).(\d{1,2}).(\d{1,2})')], 
            '\s*\n\s*')[matches(.,'^.{8}$')]
    

    This XPath expression returns thi1te_t, as requested.