Search code examples
xpathxpath-1.0

Match elements containing both a specific case-insensitive whole word and any number of any digits


Example HTML:

<root>
<td><p><b>Random Text</b></p>
<p><b>Random Text:</b> Random Text</p>
<p><b>Random Text:</b> 001057567</p>
<p><b>Random Text:</b> Random Text</p>
<p><b>EXAMPLE:</b> 00887546858</p>
</td>
</root>

I need to match the

element that contains "EXAMPLE" and a random number, but I need "EXAMPLE" to be case-insensitive and a whole word only (it must either be the first word in a string or be both preceded and followed by a space or any punctuation mark).

It must be an XPath 1.0 query because the environment I'm working in doesn't support newer XPath versions.

Right now, I have this query:

//*
   [contains(., 'EXAMPLE') and translate(., translate(., '0123456789', ''), '') != '']
   [not(
      *[contains(., 'EXAMPLE') and translate(., translate(., '0123456789', ''), '') != '']
   )]

It only searches for elements that contain EXAMPLE capitalized and regardless of whether it's a whole word or not.

I need to be able to match such cases too:

<root>
<td><p><b>Random Text</b></p>
<p><b>Random Text:</b> Random Text</p>
<p><b>Random Text:</b> 001057567</p>
<p><b>Random Text:</b> Random Text</p>
<p><b>for eXaMpLe:</b> 00887546858</p>
</td>
</root>

or

<root>
<td><p><b>Random Text</b></p>
<p><b>Random Text:</b> Random Text</p>
<p><b>Random Text:</b> 001057567</p>
<p><b>Random Text:</b> Random Text</p>
<p>test,eXaMpLe:00887546858</p>
</td>
</root>

But at the same time, I need to skip such cases:

<root>
<td><p><b>Random Text</b></p>
<p><b>Random Text:</b> Random Text</p>
<p><b>Random Text:</b> 001057567</p>
<p><b>Random Text:</b> Random Text</p>
<p><b>534534tretetEXAMPLE:</b> 00887546858</p>
</td>
</root>

or

<root>
<td><p><b>Random Text</b></p>
<p><b>Random Text:</b> Random Text</p>
<p><b>Random Text:</b> 001057567</p>
<p><b>Random Text:</b> Random Text</p>
<p><b>EXAMPLE00887546858</p>
</td>
</root>

I asked ChatGPT about the solution numerous times, but it keeps providing incorrect answers that either don't match anything on the page or match the whole body.


Solution

  • There are several requirements here which are easy to achieve in XPath 2.0 or later, but much harder in XPath 1.0 which lacks features like:

    • conversion between upper case and lower case
    • matching regular expressions
    • string tokenizing
    • user-defined functions and variables

    This means that an XPath 1.0 expression is going to be quite wordy and repetitive, but it's easy enough to work it up, if you go step by step.

    Case insensitivity

    XPath 1.0 has no notion of "upper case" and "lower case", so you have to do this conversion using the translate function. The first parameter to to this function is the string you want to convert, the second specifies a list of characters you want to replace, and the third parameter specifies a matching list of characters you want to replace them with. e.g.

    translate('eXaMpL', 'exampl', 'EXAMPLE') = 'EXAMPLE'
    

    With this function you can convert the textual content of an element to upper case, and then compare that string to the upper case version of the string you want. e.g. This expression searches for an element which, if you translate it to upper case, contains the word 'EXAMPLE' in upper case:

    //*[contains(translate(., 'example', 'EXAMPLE'), 'EXAMPLE')]
    

    Whole word matching

    XPath 1.0 has no way of parsing strings into a set of tokens separated by some set of separator characters, or even any way to represent a sequence of such tokens (i.e. no such thing as a sequence of strings, as there is in XPath 2.0)

    The best you can do here is to normalize the string you want to search in, so that the various separator characters are all replaced with a common separator (a space, for example), and then search in that string for the string you want, with spaces around it. i.e.

    translate('blah.blah,blah;blah', ':,;.', '    ') = 'blah blah blah blah'
    

    ... and then you can search such a normalized string value for the substring ' EXAMPLE ' (note the spaces around the word EXAMPLE which ensure that you find only the whole word, and won't find e.g. 123EXAMPLE)

    //*[contains(concat(' ', translate(., ':,;.', '    '), ' '), ' EXAMPLE ')]
    

    Checking for digits

    You also need to check for the presence of digits. The simplest way I think is to use the translate function to replace any digits with the empty string, and check if the translated string is different to the original.

    . != translate(., '0123456789', '')
    

    Whole-word, case-insensitive search

    Putting all these together, you get:

    //*[
       contains(
          concat(
             ' ', 
             translate(
                translate(., 'example', 'EXAMPLE'), 
                ':,;.', 
                '    '
             ),
             ' '
          ), 
          ' EXAMPLE '
       ) and
       . != translate(., '0123456789', '')
    ]
    

    Finally, you need to exclude elements from the results if they have child elements which would also be in the results. Otherwise you'll end up returning not just an element x that matches the test, but also x's parent element, and all ancestors right up to the root element. To exclude those elements, you need to take the entire expression above and use it as a filter to exclude elements if they have a child element that matches the filter.

    //*[
      contains(
        concat(
          ' ',
          translate(
            translate(., 'example', 'EXAMPLE'), 
            ':,;.', 
            '    '
          ),
          ' '
        ), 
        ' EXAMPLE '
      ) and
      . != translate(., '0123456789', '')
    ]
    [not(
    *[
      contains(
        concat(
          ' ',
          translate(
            translate(., 'example', 'EXAMPLE'), 
            ':,;.', 
            '    '
          ),
          ' '
        ), 
        ' EXAMPLE '
      ) and
      . != translate(., '0123456789', '')
    ]
    )]
    

    It's quite a monstrous expression, in the end.

    If you have any chance to upgrade your XPath engine, you could have an XPath 3.1 expression about 5% - 10% as long as this, and it would be easier to understand and maintain.