Search code examples
htmlweb-scrapingrvestrcurlhttr

R Webscraping RCurl and httr Content


I'm learning a bit about webscraping and I'm having a little doubt regarding 2 packages (httr and RCurl), I'm trying to get a code from a magazine (ISSN) on the researchgate website and I came across a situation. When extracting the content from the site by httr and RCurl, I get the ISSN in the RCurl package and in httr my function is returning NULL, could anyone tell me why this? in my opinion it was for both functions to be working. Follow the code below.

library(rvest)
library(httr)
library(RCurl)

url <- "https://www.researchgate.net/journal/0730-0301_Acm_Transactions_On_Graphics"

########
# httr #
########

conexao <- GET(url)
conexao_status <- http_status(conexao)
conexao_status

content(conexao, as = "text", encoding = "utf-8") %>% read_html() -> webpage1

ISSN <- webpage1 %>%
  html_nodes(xpath = '//*/div/div[2]/div[1]/div[1]/table[2]/tbody/tr[7]/td') %>%
  html_text %>%
  str_to_title() %>%
  str_split(" ") %>%
  unlist
ISSN

########
# RCurl #
########

options(RCurlOptions = list(verbose = FALSE, 
                            capath = system.file("CurlSSL", "cacert.pem", package = "RCurl"), 
                            ssl.verifypeer = FALSE))

webpage <- getURLContent(url) %>% read_html()

ISSN <- webpage %>%
  html_nodes(xpath = '//*/div/div[2]/div[1]/div[1]/table[2]/tbody/tr[7]/td') %>%
  html_text %>%
  str_to_title() %>%
  str_split(" ") %>%
  unlist
ISSN

sessionInfo() R version 3.5.0 (2018-04-23) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale: [1] LC_COLLATE=Portuguese_Brazil.1252 LC_CTYPE=Portuguese_Brazil.1252 LC_MONETARY=Portuguese_Brazil.1252 [4] LC_NUMERIC=C LC_TIME=Portuguese_Brazil.1252

attached base packages: [1] stats graphics grDevices utils
datasets methods base

other attached packages: [1] testit_0.7 dplyr_0.7.4
progress_1.1.2 readxl_1.1.0 stringr_1.3.0 RCurl_1.95-4.10 bitops_1.0-6 [8] httr_1.3.1 rvest_0.3.2 xml2_1.2.0
jsonlite_1.5

loaded via a namespace (and not attached): [1] Rcpp_0.12.16
bindr_0.1.1 magrittr_1.5 R6_2.2.2 rlang_0.2.0
tools_3.5.0 [7] yaml_2.1.19 assertthat_0.2.0 tibble_1.4.2 bindrcpp_0.2.2 curl_3.2 glue_1.2.0
[13] stringi_1.1.7 pillar_1.2.2 compiler_3.5.0
cellranger_1.1.0 prettyunits_1.0.2 pkgconfig_2.0.1


Solution

  • Because the content type is JSON and not HTML, you can't use read_html() on it:

    > conexao
    Response [https://www.researchgate.net/journal/0730-0301_Acm_Transactions_On_Graphics]
    Date: 2018-06-02 03:15
    Status: 200
    Content-Type: application/json; charset=utf-8
    Size: 328 kB
    

    Use fromJSON() instead to extract issn:

    library(jsonlite)
    result <- fromJSON(content(conexao, as = "text", encoding = "utf-8") )
    result$result$data$journalFullInfo$data$issn
    

    result:

    > result$result$data$journalFullInfo$data$issn
    [1] "0730-0301"