At the moment, I'm trying to scrape forms from some sites using the following query:
select * from html
where url="http://somedomain.com"
and xpath="//form[@action]"
This returns a result like so:
{
form: {
action: "/some/submit",
id: "someId",
div: {
input: [
... some input elements here
]
}
fieldset: {
div: {
input: [
... some more input elements here
]
}
}
}
}
On some sites this could go many levels deep, so I'm not sure how to begin trying to filter out the unwanted elements in the result. If I could filter them out here, then it would make my back-end code much simpler. Basically, I'd just like the form and any label, input, select (and option) and textarea descendants.
Here's an XPath query I tried, but I realised that the element hierarchy would not be maintained and this might cause a problem if there are multiple forms on the page:
//form[@action]/descendant-or-self::*[self::form or self::input or self::select or self::textarea or self::label]
However, I did notice that the elements returned by this query were no longer returned under divs and other elements beneath the form.
I don't think it will be possible in a plain query as you have tried.
However, it would not be too much work to create a new data table containing some JavaScript that does the filtering you're looking for.
Data table
A quick, little <execute>
block might look something like the following.
var elements = y.query("select * from html where url=@u and xpath=@x", {u: url, x: xpath}).results.elements();
var results = <url url={url}></url>;
for each (element in elements) {
var result = element.copy();
result.setChildren("");
result.normalize();
for each (descendant in y.xpath(element, filter)) {
result.node += descendant;
}
results.node += result;
}
response.object = results;
» See the full example data table.
Example query
use "store://VNZVLxovxTLeqYRH6yQQtc" as example;
select * from example where url="http://www.yahoo.com"
» See this query in the YQL console
Example results
Hopefully the above is a step in the right direction, and doesn't look too daunting.
Links