Search code examples
rvesthtmltext

how to separate html_text result using rvest?


I am trying to scrape information from google scholar web page:

https://scholar.google.com/citations?view_op=search_authors&hl=en&mauthors=label:materials_science

library(rvest)

htmlfile<-"https://scholar.google.com/citations?view_op=search_authors&hl=en&mauthors=label:materials_science"

g_interest<- read_html(htmlfile) %>% html_nodes("div.gsc_oai_int") %>% html_text()

I got the following result:

 [1] "Quantum Chemistry Electronic Structure Condensed Matter Physics Materials Science Nanotechnology "                   
 [2] "density functional theory first principles calculations many body theory condensed matter physics materials science "
 [3] "chemistry materials science physics nanotechnology "                                                                 
 [4] "Materials Science Nanotechnology Chemistry Physics "                                                                 
 [5] "Physics Theoretical Physics Condensed Matter Theory Materials Science Nanoscience "                                  
 [6] "Materials Science Quantum Chemistry Fiber Optic Sensors Geophysics "                                                 
 [7] "Chemical Physics Condensed Matter Materials Science Magnetic Properties NMR "                                        
 [8] "Materials Science "                                                                                                  
 [9] "Materials Science Physics "                                                                                          
[10] "Physics Materials Science Theoretical Physics Nanoscience "                                                          

However, I would like to get the results like:

[1]"Quantum Chemistry; Electronic Structure;Condensed Matter Physics; Materials Science; Nanotechnology " 
......

Any suggestions how to separate the results with ";"?


Solution

  • You can make use of purrr and stringr packages, extract all nodes first and concatenate individual ones.

    library(rvest)
    library(purrr)
    library(stringr)
    
    htmlfile<-"https://scholar.google.com/citations?view_op=search_authors&hl=en&mauthors=label:materials_science"
    
    content_nodes<- read_html(htmlfile) %>% html_nodes("div.gsc_oai_int")
    
    map_chr(content_nodes,~.x %>%
            html_nodes(".gsc_oai_one_int") %>%
            html_text() %>%
            str_c(collapse = ";"))
    

    result:

    [1] "Quantum Chemistry;Electronic Structure;Condensed Matter Physics;Materials Science;Nanotechnology"                   
    [2] "density functional theory;first principles calculations;many body theory;condensed matter physics;materials science"
    [3] "chemistry;materials science;physics;nanotechnology"                                                                 
    [4] "Materials Science;Nanotechnology;Chemistry;Physics"                                                                 
    [5] "Physics;Theoretical Physics;Condensed Matter Theory;Materials Science;Nanoscience"                                  
    [6] "Materials Science;Quantum Chemistry;Fiber Optic Sensors;Geophysics"                                                 
    [7] "Chemical Physics;Condensed Matter;Materials Science;Magnetic Properties;NMR"                                        
    [8] "Materials Science"                                                                                                  
    [9] "Materials Science;Physics"                                                                                          
    [10] "Physics;Materials Science;Theoretical Physics;Nanoscience"