Search code examples
pythonhtmlxpathscrapyscraper

How do you extract an embedded attribute value from a previous attribute value in an XPath query?


I'm trying to "select" the link from the onclick attribute in the following portion of html

<span onclick="Javascript:document.quickFindForm.action='/blah_blah'" 
 class="specialLinkType"><img src="blah"></span>

but can't get any further than the following XPath

//span[@class="specialLinkType"]/@onclick

which only returns

Javascript:document.quickFindForm.action

Any ideas on how to pick out that link inside of the quickFindForm.action with an XPath?


Solution

  • I tried the XPath in a Java application and it worked ok:

        import java.io.IOException;
        import java.io.StringReader;
    
        import javax.xml.parsers.DocumentBuilder;
        import javax.xml.parsers.DocumentBuilderFactory;
        import javax.xml.parsers.ParserConfigurationException;
        import javax.xml.xpath.XPath;
        import javax.xml.xpath.XPathExpression;
        import javax.xml.xpath.XPathFactory;
    
        import org.w3c.dom.Document;
        import org.xml.sax.InputSource;
        import org.xml.sax.SAXException;
    
        public class Teste {
    
            public static void main(String[] args) throws Exception {
                Document doc = stringToDom("<span onclick=\"Javascript:document.quickFindForm.action='/blah_blah'\" class=\"specialLinkType\"><img src=\"blah\"/></span>");
                XPath newXPath = XPathFactory.newInstance().newXPath();
                XPathExpression xpathExpr = newXPath.compile("//span[@class=\"specialLinkType\"]/@onclick");
                String result = xpathExpr.evaluate(doc);
                System.out.println(result);
    
            }
    
            public static Document stringToDom(String xmlSource) throws SAXException, ParserConfigurationException, IOException {
                DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
                DocumentBuilder builder = factory.newDocumentBuilder();
                return builder.parse(new InputSource(new StringReader(xmlSource)));
            }
        }
    

    Result:

    Javascript:document.quickFindForm.action='/blah_blah'