R Scrape Atom Feed to a Data Frame

I'm working on a scraper in R for an Atom feed and having issues grabbing the link for each article. Here's my code:

url <- "http://www.stwnewspress.com/search/?mode=article&q=&nsa=eedition&t=article&l=1000&s=&sd=desc&f=atom&d=&d1=&d2="
pageSource <- getURL(url, encoding = "UTF-8")
parsed <- htmlParse(pageSource)
titles <- xpathSApply(parsed, '//entry/title', xmlValue)
authors <- xpathSApply(parsed, '//entry/author', xmlValue)
links <- xpathSApply(parsed, '//entry/link/@href')
dataFrame <- data.frame(pubDates, titles, authors)

My problem is I'm picking up 18 titles, 18 authors, and 20 links. I think I'm picking up the first two links from the feed page, but I'm not sure how to stop picking them up.

Thanks for your help!

Solution

You can work from "//entry" rather then the individual nodes. Some entry nodes have multiple links for example:

out <- xpathApply(parsed, "//entry", function(x){
  children <- xmlChildren(x)
  title <- xmlValue(children$title)
  author <- xmlValue(children$author)
  links <- children[names(children)%in%"link"]
  links <- sapply(links, function(y){xmlGetAttr(y, "href")})
  data.frame(title, author, links, stringsAsFactors = FALSE)
})

> out[[1]]
                                            title            author
1 Soap opera star in serious injury crash in Ohio CNHI News Service
2 Soap opera star in serious injury crash in Ohio CNHI News Service
                                                                                                                                                                              links
1                                                                                        http://www.stwnewspress.com/cnhi_network/article_71fb99db-0d47-5ead-9276-cae9c947babc.html
2 http://bloximages.chicago2.vip.townnews.com/stwnewspress.com/content/tncms/assets/v3/editorial/d/97/d97a9815-29c8-5b90-be11-41a3a8b12e9f/54354a7b66bd9.image.jpg?resize=300%2C450
> out[[2]]
                                     title                                    author
link Q5: Voter registration deadline nears By Michelle Charles/Stillwater News Press
                                                                                             links
link http://www.stwnewspress.com/news/local_news/article_ba35bd60-4ea4-11e4-8da8-93d495865336.html

You can then rbind your individual entries together:

res <- do.call(rbind.data.frame, out)
> str(res)
'data.frame':   147 obs. of  3 variables:
 $ title : chr  "Soap opera star in serious injury crash in Ohio" "Soap opera star in serious injury crash in Ohio" "Q5: Voter registration deadline nears" "Oklahoma State assault under investigation" ...
 $ author: chr  "CNHI News Service" "CNHI News Service" "By Michelle Charles/Stillwater News Press" "By Megan Sando/Stillwater News Press" ...
 $ links : chr  "http://www.stwnewspress.com/cnhi_network/article_71fb99db-0d47-5ead-9276-cae9c947babc.html" "http://bloximages.chicago2.vip.townnews.com/stwnewspress.com/content/tncms/assets/v3/editorial/d/97/d97a9815-29c8-5b90-be11-41a"| __truncated__ "http://www.stwnewspress.com/news/local_news/article_ba35bd60-4ea4-11e4-8da8-93d495865336.html" "http://www.stwnewspress.com/news/local_news/article_7023a110-4ea4-11e4-82dd-f735d5c5ed44.html" ...

To understand how the function works look at the first entry calling it x:

url <- "http://www.stwnewspress.com/search/?mode=article&q=&nsa=eedition&t=article&l=1000&s=&sd=desc&f=atom&d=&d1=&d2="
pageSource <- getURL(url, encoding = "UTF-8")
parsed <- htmlParse(pageSource)
x <- parsed["//entry"][[1]]
children <- xmlChildren(x)

> names(children)
[1] "title"    "author"   "link"     "id"       "content"  "category"
[7] "updated"

> children$title
<title>BYRON YORK: Jindal a GOP darkhorse in 2016 race</title> 

> xmlValue(children$title)
[1] "BYRON YORK: Jindal a GOP darkhorse in 2016 race"