Search code examples
xmlrscraperhttr

Problems scraping web page in R


I'm tryig to scrape an specific location of a web page using XPath to find it. The path seems to be "hidden" as other parts of the web page are easily reachable, but this section returns a NULL value.

I've tried using several packages, but i'm really not an expert in the subject so i can't really assess what's going on and if the is a way to solve it.

This is what i've tried.

require("XML")
require("scrapeR")
require("httr")

url <- "http://www.claro.com.ar/portal/ar/pc/personas/movil/eq-new/?eq=537"
xp <- '//*[@id="dv_MainContainerEquiposResumen"]/div[1]/h1'

page <- scrape(url)
xpathApply(page[[1]], xp, xmlValue)
# NULL

url.get = GET(url)
xpathSApply(content(url.get), xp)
# NULL

webpage = getURL(url)
doc = htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE)
xpathSApply(doc, xp)
# NULL

Solution

  • You can scrape the page using Selenium and the RSelenium package:

    url <- "http://www.claro.com.ar/portal/ar/pc/personas/movil/eq-new/?eq=537"
    xp <- '//*[@id="dv_MainContainerEquiposResumen"]/div[1]/h1'
    require(RSelenium)
    RSelenium::startServer()
    remDr <- remoteDriver()
    remDr$open()
    remDr$navigate(url)
    webElem <- remDr$findElement(value = xp)
    > webElem$getElementAttribute("outerHTML")[[1]]
    [1] "<h1>Samsung Galaxy Core</h1>"
    > webElem$getElementAttribute("innerHTML")[[1]]
    [1] "Samsung Galaxy Core"
    remDr$close()
    remDr$closeServer()