Search code examples
parsingxpath

how to use XPath/CSS to pick up the value/content of an attribute whose name begins with a colon?


there is an HTML tag =

<nav-categories id="MainMenu" :json-data="{some data}">text</nav-categories>

I need to pick up the contents ":json-data" standard methods (response.css('::attr(":json-data")') or response.css('::attr("\:json-data")')) do not lead to success... I use Python + Scrapy (response.selector)


Solution

  • Scrapy depends on lxml so lxml was used in the answer instead of scrapy.
    XPath does not allow a colon on an expression but is able to evaluate de element/attribute name.

    >>> tree.xpath('//nav-categories/@:json-data')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "src/lxml/etree.pyx", line 2314, in lxml.etree._ElementTree.xpath
      File "src/lxml/xpath.pxi", line 357, in lxml.etree.XPathDocumentEvaluator.__call__
      File "src/lxml/xpath.pxi", line 225, in lxml.etree._XPathEvaluatorBase._handle_result
    lxml.etree.XPathEvalError: Invalid expression
    

    Using name() XPath function as a workaround:

    >>> from lxml import html
    >>> tree = html.parse(r'/home/lmc/tmp/test.html')      
    >>> tree.xpath('//nav-categories/@*[name()=":json-data"]')
    ['{some data}']