Search code examples
rrvest

webscraping: capture links of references with R


I want to capture the links to references from an article on this page: https://www.scielo.org.mx/scielo.php?script=sci_arttext&pid=S2448-76782022000100004&lang=es

I have tried this:

 library(rvest)
library(dplyr)
 link <- "https://www.scielo.org.mx/scielo.php?script=sci_arttext&pid=S2448-76782022000100004&lang=es"
page <- read_html(link)
 links <- page %>% 
    html_nodes("a") %>%
    html_text()

But these are not the links that I want to.

There are 68 references so I want the 68 links attached to those references


Solution

  • I have been looking the site and found that the [ links ] labels runs some javascript at onclick event that sends you to an intermediate site, page etc. Thus so far it is not easy to scrap from them. I found this solution that matches 65 of the 68 links written as text in the "#article-back" section. It seems three links are not well formatted thus not matched (i.e. "h ttp://"). I hope it has been helpful.

    Edit: Regexp taken from this answer

    library(rvest)
    library(dplyr)
     
    link <- "https://www.scielo.org.mx/scielo.php?script=sci_arttext&pid=S2448-76782022000100004&lang=es"
    page <- read_html(link)
    
    text <- page %>% html_node("#article-back") %>% 
        html_text()
    
     
    matches <- gregexpr(
      "\\b(https?|ftp|file)://)?[-A-Za-z0-9+&@#/%?=~_|!:,.;]+[-A-Za-z0-9+&@#/%=~_|]",
      links)
     
    links <- regmatches(links, matches)
    

    Edit 2 For scrap from the javascript in onclick:

    library(rvest)
    library(dplyr)
     
    link <- "https://www.scielo.org.mx/scielo.php?script=sci_arttext&pid=S2448-76782022000100004&lang=es"
    page <- read_html(link)
    
    text <- page %>% html_node("#article-back") %>% 
        html_nodes("a") %>% html_attr("onclick") 
    
    links <- gsub(".*(/[^']+).*", "https://www.scielo.org.mx/\\1", text[!is.na(text)])
    
    links_pid <- gsub(".*pid=([^&]+)&.*", "\\1", links)
    links_pid