Search code examples
pythonselenium-webdriverxpath

Python Selenium - Get sibling link depending on inner text of span


I've been working on this for hours, but just can't seem to put all the parts together... So given:

<a href="link1">link</a>
<span class="class_name">00A<span>
...
<a href="link2">link</a>
<span class="class_name">00B<span>
...
<a href="link3">link</a>
<span class="class_name">01B<span>
...
<a href="link4">link</a>
<span class="class_name">01A<span>

I'm trying to get the link depending on the inner text of span. So I know... I can get all the links with:

links = [my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.XPATH, "//span[contains(@class, 'class_name')]//preceding-sibling::a[@href]")))]

I can get the text on a single span with:

print(driver.find_element(By.XPATH, "//span[contains(@class, 'class_name')]").text)

But I cant use find elements to get all of their text to test since it's asking for text of a list. I should be able to use:

[contains(text(), '\\d+[A]')]")

But I don't know how to combine it with the code for all the links. I feel like I'm overlooking something really stupid but it's 6:30am and I started working on this project in the evening yesterday, so I give up and just going to ask someone more intelligent. Thank you in advance for any help.


Solution

  • Note that the second parameter of the contains() function is not a regular expression; it's a plain string which is to be sought within the first string parameter. I believe with Selenium you are stuck with XPath 1.0 which does not have any regular expression functions.

    Without using a regular expression, if you wanted to filter a set of span elements to include only those whose text content consisted of a string of digits followed by a single A, you would need to use a more complicated expression which combines a bunch of string functions, e.g. something like:

    span[
       contains(., 'A') and
       contains('0123456789', substring(., 1, 1)) and 
       translate(substring-before(., 'A'), '0123456789', '') = '' and
       substring-after(., 'A') = ''
    ]
    

    NB the . is a reference to the "context node" which in the predicate expression means one of the span elements.

    This expression means:

    span elements

    • which contain an A character somewhere; and
    • whose first character is a digit; and
    • the text before the A consists entirely of digits; and
    • where there's no text after the A (i.e. there's just one A, at the end)

    BTW, I'm not sure this expression does what you think it does:

    //span[contains(@class, 'class_name')]//preceding-sibling::a[@href]
    

    To clarify: the // in XPath is an abbreviation for the expression /descendant-or-self::node()/. So your expression could be written as:

    //span[contains(@class, 'class_name')]
       /descendant-or-self::node()/preceding-sibling::a[@href]
    

    This will return every a element (with an href attribute) which is followed by a sibling element which is either:

    • a span element with a class attribute of 'class_name'; or
    • a descendant of a span element with a class attribute of 'class_name'.

    If you know that the span and a are actually siblings then you can replace that // with the simpler / (and in my suggestion below).

    The other thing to note here is that unless each pair of span (or span descendant) and a are contained with a parent element, then the preceding-sibling::a[@href] step will return all the a elements that precede the span, not just the first such span (which is I suspect what you want to do, in that I take it that it's the immediately preceding span that provides a label for the link. You can apply the predicate [1] to the set of a[@href] elements to get just the first (in preceding-sibling order).

    So to combine these ideas, here's my suggestion:

    //span
       [
          contains(@class, 'class_name') and
          contains(., 'A') and
          contains('0123456789', substring(., 1, 1)) and 
          translate(substring-before(., 'A'), '0123456789', '') = '' and
          substring-after(., 'A') = ''
       ]
       //preceding-sibling::a[@href][1]
    

    Applied to this input:

    <body>
      
    <a href="link1">link</a>
    <span class="class_name">00A</span>
    ...
    <a href="link2">link</a>
    <span class="class_name">00B</span>
    ...
    <a href="link3">link</a>
    <span class="class_name">01B</span>
    ...
    <a href="link4">link</a>
    <span class="class_name">01A</span>
    
    </body>
    

    ... it yields:

    <a href="link1">link</a>
    <a href="link4">link</a>