Search code examples
pythonxmlxpathscrapy

Get the Element Name From Attribute Value Using Xpath


I am trying to get the element/tag name of each node where I have a particular attribute value.

I have an xml:

<a node='1'>This</a>
<b node='2'>Is</b>
<c node='23'>A</c>
<d selector='g'>Loud</d>
<e node='4'>Dog</e>

I have a list of nodes of info I want to collect called nodes.

I select the text from these nodes with:

for node in nodes:
   get_text = response.xpath(f'//*[@node="{node}"]//text()').extract()

And I also want the names of the elements of the nodes. However, when I use this line within the same for-loop:

get_name = response.xpath(f'//*[@node="{node}"]/name()').get()

I get error:

ValueError: XPath error: Invalid expression

I have tried many variations, but am unable to get the element/tag names of each node.


Solution

  • The best way that I know how to get the names of the element tags is to use scrapy built in regex method re.

    The pattern i typicall use is r'<(\w+)\s'.

    Here is an example:

    scrapy shell

    In [1]: markup = """<html><a node='1'>This</a>
       ...: <b node='2'>Is</b>
       ...: <c node='23'>A</c>
       ...: <d selector='g'>Loud</d>
       ...: <e node='4'>Dog</e></html>"""
    
    In [2]: sel = scrapy.Selector(text=markup)
    
    In [3]: sel.xpath('//*[@node]').re('<(\w+)\s')
    Out[3]: ['a', 'b', 'c', 'e']
    
    • In the above example I take the markup from your the example you provided and wrap it in a parent tag.
    • I then use that to create a scrapy selector object.
    • then I run an xpath query to get all elements that have the node attribute
    • then use the .re method to search for the regex pattern to find the element tag name.
    • the output is a list of all the element tag names that contain the node attribute.