Search code examples
htmlrweb-scrapingrcurlrselenium

Download html through a password portal


I would like to download HTML webpages from www.geocaching.com in order to scrape some information. However the webpages I want to download have two ways of being displayed depending on whether the user is logged in. The information I want to scrape is only found when the user is logged in.

In the past I have used download.file() and mapply() to download HTML files from a list of URLs (geocache_link_list) and name them using another list (geocache_name_list) like this:

mapply(function(x,y) download.file(x,y), geocache_link_list, geocache_name_list)

but this downloads the non-logged in page.

I tried to use RCurl also, but this also downloaded the non-logged in page and so I never attempted to incorporate it into a mapply function:

library(RCurl)
baseurl <- geocache_link_list[1]
un <- readline("Type the username:")
pw <- readline("Type the password:")
upw <- paste(un, pw, sep = ":")

Is there a way to run a browser from within R using something like RSelenium or RCurl in order to enter login details then redirect to the desired pages and download them?


Solution

  • It's easy!

    library(RCurl)
    library(xml2)
    
    html_inputs <- function(p, xpath = "//form/input") {
      xml_find_all(p, xpath) %>% {setNames(as.list(xml_attr(., "value")), xml_attr(., "name"))}
    }
    get_header <- function(){
      ## RCurl设置, 直接把cookie粘贴过来,即可登录
      myHttpheader<- c(
        "User-Agent" = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71",
        # "Accept" = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
        "Accept-Language" = "zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3",
        # "Accept-Encoding"="gzip, deflate",
        "Connection"="keep-alive",
        DNT = 1, 
        "Upgrade-Insecure-Requests" = 1, 
        "Host" = "www.geocaching.com")
    
      file_cookie <- "cookies.txt"
    
      ch <- getCurlHandle(# cainfo="pem/cacert.pem",
        # ssl.verifyhost=FALSE, ssl.verifypeer = FALSE,
        followlocation = TRUE,
        verbose = TRUE, 
        cookiejar = file_cookie, cookiefile = file_cookie,
        httpheader = myHttpheader)#带上百宝箱开始上路
      tmp <- curlSetOpt(curl = ch)
      return(ch)
    }
    ch <- get_header()
    h <- basicHeaderGatherer()
    
    #input your username and password here
    user <- "kongdd"
    pwd <- "****"
    p <- getURL("https://www.geocaching.com/account/login", curl = ch)
    tooken <- html_inputs(read_html(p))[1]
    params <- list(Password = user,
                   Username = pwd) %>% c(., tooken)
    p2 <- postForm("https://www.geocaching.com/account/login", curl = ch,
             .params = params)
    
    grep("kongdd", p2)#If 1 returned, it indicate you have login successfully.
    

    After login successfully, you can access data with parameter curl.