Search code examples
xpathyqlyahoo-api

Is it possible to filter the descendant elements returned from an XPath query?


At the moment, I'm trying to scrape forms from some sites using the following query:

select * from html 
where url="http://somedomain.com" 
and xpath="//form[@action]"

This returns a result like so:

{
    form: {
        action: "/some/submit",
        id: "someId",
        div: {
            input: [
               ... some input elements here
            ]
        }
        fieldset: {
            div: {
                input: [
                    ... some more input elements here
                ]
            }
        }
    }
}

On some sites this could go many levels deep, so I'm not sure how to begin trying to filter out the unwanted elements in the result. If I could filter them out here, then it would make my back-end code much simpler. Basically, I'd just like the form and any label, input, select (and option) and textarea descendants.

Here's an XPath query I tried, but I realised that the element hierarchy would not be maintained and this might cause a problem if there are multiple forms on the page:

//form[@action]/descendant-or-self::*[self::form or self::input or self::select or self::textarea or self::label]

However, I did notice that the elements returned by this query were no longer returned under divs and other elements beneath the form.


Solution

  • I don't think it will be possible in a plain query as you have tried.

    However, it would not be too much work to create a new data table containing some JavaScript that does the filtering you're looking for.

    Data table

    A quick, little <execute> block might look something like the following.

    var elements = y.query("select * from html where url=@u and xpath=@x", {u: url, x: xpath}).results.elements();
    var results = <url url={url}></url>;
    for each (element in elements) {
        var result = element.copy();
        result.setChildren("");
        result.normalize();
        for each (descendant in y.xpath(element, filter)) {
            result.node += descendant;
        }
        results.node += result;
    }
    response.object = results;
    

    » See the full example data table.

    Example query

    use "store://VNZVLxovxTLeqYRH6yQQtc" as example;
    select * from example where url="http://www.yahoo.com"
    

    » See this query in the YQL console

    Example results

    Query results XML

    Hopefully the above is a step in the right direction, and doesn't look too daunting.

    Links