To scrape hacker news, I use:
xidel -e '//span[@class="titleline"]/a/@href|//span[@class="titleline"]' https://news.ycombinator.com/newest
But the output in not in the expected order, the URL come after the text, so it's very difficult to parse.
Does I miss something to have the good order?
I have:
There Is No Such Thing as a Microservice (youtube.com)
https://www.youtube.com/watch?v=FXCLLsCGY0s
I expect:
https://www.youtube.com/watch?v=FXCLLsCGY0s
There Is No Such Thing as a Microservice (youtube.com)
Or even better
https://www.youtube.com/watch?v=FXCLLsCGY0s There Is No Such Thing as a Microservice (youtube.com)
Please see "Using / on sequences rather than on sets" on why this is happening and why you should be using the XPath 3 mapping operator !
in this case:
$ xidel -s "https://news.ycombinator.com/newest" -e '
//span[@class="titleline"]/a ! (@href,.)
'
(also please specify input first)
For a simple string-concatenation this isn't necessary:
-e '//span[@class="titleline"]/a/join((@href,.))'
-e '//span[@class="titleline"]/a/concat(@href," ",.)'
-e '//span[@class="titleline"]/a/x"{@href} {.}"'
(Bonus) Output to JSON:
$ xidel -s "https://news.ycombinator.com/newest" -e '
array{
//span[@class="titleline"]/a/{
"title":.,
"url":@href
}
}
'