Search code examples
pythonregexxpathhref

Xpath: obtain href if contains specific word


Set up

I'm extracting hrefs from a page using the following xpath,

'/html/body/div/div[2]/div[2]/div/div/p[1]/a/@href'

which gives me a list of hrefs looking like,

['#',
 'showv2.php?p=Glasgow City&t=Anderston',
 'showv2.php?p=Glasgow City&t=Anniesland',
 'showv2.php?p=Glasgow City&t=Ashfield',
 '#',
 'showv2.php?p=Glasgow City&t=Baillieston',
           ⋮
'showv2.php?p=Glasgow City&t=Yoker']


Problem

I'm not interested in the '#' hrefs. All the hrefs I am interested in contain Glasgow. How do I select only the hrefs containing Glasgow?

I've seen answers regarding regex with 'id' etc, but not with href. Those answers do not seem to work with href.

I've seen answers regarding regex with beginning or ending of a href, but I'd like to be able to regex on 'containing' a word.


Solution

  • Use contains(@href, 'Glasgow') "restriction" on the a elements:

    '/html/body/div/div[2]/div[2]/div/div/p[1]/a[contains(@href, "Glasgow")]/@href'
    

    Then, it will only find those <a>s under the specified path that contain Glasgow inside their href attribute values.