Search code examples
rscraper

R Scrape Atom Feed to a Data Frame


I'm working on a scraper in R for an Atom feed and having issues grabbing the link for each article. Here's my code:

url <- "http://www.stwnewspress.com/search/?mode=article&q=&nsa=eedition&t=article&l=1000&s=&sd=desc&f=atom&d=&d1=&d2="
pageSource <- getURL(url, encoding = "UTF-8")
parsed <- htmlParse(pageSource)
titles <- xpathSApply(parsed, '//entry/title', xmlValue)
authors <- xpathSApply(parsed, '//entry/author', xmlValue)
links <- xpathSApply(parsed, '//entry/link/@href')
dataFrame <- data.frame(pubDates, titles, authors)

My problem is I'm picking up 18 titles, 18 authors, and 20 links. I think I'm picking up the first two links from the feed page, but I'm not sure how to stop picking them up.

Thanks for your help!


Solution

  • You can work from "//entry" rather then the individual nodes. Some entry nodes have multiple links for example:

    out <- xpathApply(parsed, "//entry", function(x){
      children <- xmlChildren(x)
      title <- xmlValue(children$title)
      author <- xmlValue(children$author)
      links <- children[names(children)%in%"link"]
      links <- sapply(links, function(y){xmlGetAttr(y, "href")})
      data.frame(title, author, links, stringsAsFactors = FALSE)
    })
    
    > out[[1]]
                                                title            author
    1 Soap opera star in serious injury crash in Ohio CNHI News Service
    2 Soap opera star in serious injury crash in Ohio CNHI News Service
                                                                                                                                                                                  links
    1                                                                                        http://www.stwnewspress.com/cnhi_network/article_71fb99db-0d47-5ead-9276-cae9c947babc.html
    2 http://bloximages.chicago2.vip.townnews.com/stwnewspress.com/content/tncms/assets/v3/editorial/d/97/d97a9815-29c8-5b90-be11-41a3a8b12e9f/54354a7b66bd9.image.jpg?resize=300%2C450
    > out[[2]]
                                         title                                    author
    link Q5: Voter registration deadline nears By Michelle Charles/Stillwater News Press
                                                                                                 links
    link http://www.stwnewspress.com/news/local_news/article_ba35bd60-4ea4-11e4-8da8-93d495865336.html
    

    You can then rbind your individual entries together:

    res <- do.call(rbind.data.frame, out)
    > str(res)
    'data.frame':   147 obs. of  3 variables:
     $ title : chr  "Soap opera star in serious injury crash in Ohio" "Soap opera star in serious injury crash in Ohio" "Q5: Voter registration deadline nears" "Oklahoma State assault under investigation" ...
     $ author: chr  "CNHI News Service" "CNHI News Service" "By Michelle Charles/Stillwater News Press" "By Megan Sando/Stillwater News Press" ...
     $ links : chr  "http://www.stwnewspress.com/cnhi_network/article_71fb99db-0d47-5ead-9276-cae9c947babc.html" "http://bloximages.chicago2.vip.townnews.com/stwnewspress.com/content/tncms/assets/v3/editorial/d/97/d97a9815-29c8-5b90-be11-41a"| __truncated__ "http://www.stwnewspress.com/news/local_news/article_ba35bd60-4ea4-11e4-8da8-93d495865336.html" "http://www.stwnewspress.com/news/local_news/article_7023a110-4ea4-11e4-82dd-f735d5c5ed44.html" ...
    

    To understand how the function works look at the first entry calling it x:

    url <- "http://www.stwnewspress.com/search/?mode=article&q=&nsa=eedition&t=article&l=1000&s=&sd=desc&f=atom&d=&d1=&d2="
    pageSource <- getURL(url, encoding = "UTF-8")
    parsed <- htmlParse(pageSource)
    x <- parsed["//entry"][[1]]
    children <- xmlChildren(x)
    
    > names(children)
    [1] "title"    "author"   "link"     "id"       "content"  "category"
    [7] "updated"
    
    > children$title
    <title>BYRON YORK: Jindal a GOP darkhorse in 2016 race</title> 
    
    > xmlValue(children$title)
    [1] "BYRON YORK: Jindal a GOP darkhorse in 2016 race"