Search code examples
rrvest

webscrapping Scielo for references of an articulo with rvest


I want to extract the references from an article on this page:

https://www.scielo.org.mx/scielo.php?script=sci_arttext&pid=S2448-76782022000100004&lang=es

I have tried this:

library(rvest)
library(dplyr)
product_names = simple %>% 
  html_nodes(xpath= '//*[contains(concat( " ", @class, " " ), concat( " ", "references", " " ))]') %>%
  html_text()

but did not work

How can I extract the references?


Solution

  • Here is a way.
    The main complication is the presence of multi-byte characters at the end of each string.

    suppressPackageStartupMessages({
      library(rvest)
      library(dplyr)
    })
    
    link <- "https://www.scielo.org.mx/scielo.php?script=sci_arttext&pid=S2448-76782022000100004&lang=es"
    page <- read_html(link)
    
    page %>%
      html_elements(xpath = '//*[@id="article-back"]') %>%
      html_elements("p") %>%
      html_text() %>%
      gsub("[\n\t]", "", .) %>%
      gsub("\\[|\\]", "", .) %>%
      gsub("Links", "", .) %>%
      iconv(from = 'UTF-8', to = 'ASCII//TRANSLIT') %>%
      trimws() -> refs
    
    refs <- refs[3:70]
    
    head(refs)
    #> [1] "Alaie, S. A. (2020). Knowledge and learning in the horticultural innovation system: A case of Kashmir valley of India. International Journal of Innovation Studies, 4(1), 116-133. https://doi.org/10.1016/j.ijis.2020.06.002."                                                                  
    #> [2] "Andersson, U., Dasi, A., Mudambi, R., & Pedersen, T. (2016)Technology, innovation and knowledge: The importance of ideas and internationalconnectivity. Journal of World Business,51(1), 153-162.https://doi.org/10.1016/j.jwb.2015.08.017."                                                     
    #> [3] "Arroyo, F. J., Sanchez, J., & Sole, M. L. (2017). La calidad e innovacion como factores de diferenciacion para el comercio electronico de ropa interior de una marca latinoamericana en Espana. Contabilidad y Negocios, 12(23), 52-61. h ttps://doi.org/10.18800/contabilidad.201701.004."      
    #> [4] "Bach, H., Makitie, T., Hansen, T., & Steen, M. (2021). Blending new and old in sustainability transitions: Technological alignment between fossil fuels and biofuels in Norwegian coastal shipping. Energy Research & Social Science, 74(1), 101957. https://doi.org/10.1016/j.erss.2021.101957."
    #> [5] "Bodas, I. M., Marques, R. A.., & Silva, E. M. (2013). University-industry collaboration and innovation in emergent and mature industries in new industrialized countries. Research Policy, 42(2), 443-453. https://doi.org/10.1016/j.respol.2012.06.006."                                        
    #> [6] "Bourke, J., & Roper, S. (2017). Innovation, quality management and learning: Short-term and longer-term e?ects. Research Policy, 46(1), 1505-1518. https://doi.org/10.1016/j.respol.2017.07.005."
    

    Created on 2022-10-21 with reprex v2.0.2