Search code examples
xmlrrcurlhttr

scrape url from google search using r httr


I would like to get the URL from a Google websearch as follows:

library(httr)
search.term="httr+package+daterange:%3A2456294-2456659"
url.name=paste0("https://www.google.com/search?q=",search.term)
url.get=GET(url.name)
url.content=content(url.get)

Then the attempt to get the links out of the result fails:

links <- xpathApply(url.content, "//h3//a[@href]", function(x) xmlAttrs(x)[[1]])
Error in UseMethod("xpathApply") : 
no applicable method for 'xpathApply' applied to an object of class "XMLDocumentContent"

What is the best method of getting the links out of the url.content?


Solution

  • Try content() with as="text" to prevent it from returning an object of class XMLDocumentContent:

    library(httr)
    search.term="httr+package+daterange:%3A2456294-2456659"
    url.name=paste0("https://www.google.com/search?q=",search.term)
    url.get=GET(url.name)
    url.content=content(url.get, as="text")
    links <- xpathSApply(htmlParse(url.content), "//a/@href")
    head(links,3)
    # href 
    # "https://www.google.com/webhp?tab=ww" 
    # href 
    # "https://www.google.com/search?q=httr%2Bpackage%2Bdaterange::2456294-2456659&um=1&ie=UTF-8&hl=en&tbm=isch&source=og&sa=N&tab=wi" 
    # href 
    # "https://maps.google.com/maps?q=httr%2Bpackage%2Bdaterange::2456294-2456659&um=1&ie=UTF-8&hl=en&sa=N&tab=wl" 
    

    Update:

    As Jake points out in a comment, this also works:

    url.get=GET(url.name)
    links <- xpathSApply(htmlParse(url.get), "//a/@href")