I'm working on a scraper in R for an Atom feed and having issues grabbing the link for each article. Here's my code:
url <- "http://www.stwnewspress.com/search/?mode=article&q=&nsa=eedition&t=article&l=1000&s=&sd=desc&f=atom&d=&d1=&d2="
pageSource <- getURL(url, encoding = "UTF-8")
parsed <- htmlParse(pageSource)
titles <- xpathSApply(parsed, '//entry/title', xmlValue)
authors <- xpathSApply(parsed, '//entry/author', xmlValue)
links <- xpathSApply(parsed, '//entry/link/@href')
dataFrame <- data.frame(pubDates, titles, authors)
My problem is I'm picking up 18 titles, 18 authors, and 20 links. I think I'm picking up the first two links from the feed page, but I'm not sure how to stop picking them up.
Thanks for your help!
You can work from "//entry" rather then the individual nodes. Some entry nodes have multiple links for example:
out <- xpathApply(parsed, "//entry", function(x){
children <- xmlChildren(x)
title <- xmlValue(children$title)
author <- xmlValue(children$author)
links <- children[names(children)%in%"link"]
links <- sapply(links, function(y){xmlGetAttr(y, "href")})
data.frame(title, author, links, stringsAsFactors = FALSE)
})
> out[[1]]
title author
1 Soap opera star in serious injury crash in Ohio CNHI News Service
2 Soap opera star in serious injury crash in Ohio CNHI News Service
links
1 http://www.stwnewspress.com/cnhi_network/article_71fb99db-0d47-5ead-9276-cae9c947babc.html
2 http://bloximages.chicago2.vip.townnews.com/stwnewspress.com/content/tncms/assets/v3/editorial/d/97/d97a9815-29c8-5b90-be11-41a3a8b12e9f/54354a7b66bd9.image.jpg?resize=300%2C450
> out[[2]]
title author
link Q5: Voter registration deadline nears By Michelle Charles/Stillwater News Press
links
link http://www.stwnewspress.com/news/local_news/article_ba35bd60-4ea4-11e4-8da8-93d495865336.html
You can then rbind
your individual entries together:
res <- do.call(rbind.data.frame, out)
> str(res)
'data.frame': 147 obs. of 3 variables:
$ title : chr "Soap opera star in serious injury crash in Ohio" "Soap opera star in serious injury crash in Ohio" "Q5: Voter registration deadline nears" "Oklahoma State assault under investigation" ...
$ author: chr "CNHI News Service" "CNHI News Service" "By Michelle Charles/Stillwater News Press" "By Megan Sando/Stillwater News Press" ...
$ links : chr "http://www.stwnewspress.com/cnhi_network/article_71fb99db-0d47-5ead-9276-cae9c947babc.html" "http://bloximages.chicago2.vip.townnews.com/stwnewspress.com/content/tncms/assets/v3/editorial/d/97/d97a9815-29c8-5b90-be11-41a"| __truncated__ "http://www.stwnewspress.com/news/local_news/article_ba35bd60-4ea4-11e4-8da8-93d495865336.html" "http://www.stwnewspress.com/news/local_news/article_7023a110-4ea4-11e4-82dd-f735d5c5ed44.html" ...
To understand how the function works look at the first entry calling it x
:
url <- "http://www.stwnewspress.com/search/?mode=article&q=&nsa=eedition&t=article&l=1000&s=&sd=desc&f=atom&d=&d1=&d2="
pageSource <- getURL(url, encoding = "UTF-8")
parsed <- htmlParse(pageSource)
x <- parsed["//entry"][[1]]
children <- xmlChildren(x)
> names(children)
[1] "title" "author" "link" "id" "content" "category"
[7] "updated"
> children$title
<title>BYRON YORK: Jindal a GOP darkhorse in 2016 race</title>
> xmlValue(children$title)
[1] "BYRON YORK: Jindal a GOP darkhorse in 2016 race"