Search code examples
rweb-scrapingnodesscreenrvest

I cant access sub nodes for web scraping


I am trying to scrape players information from website using the following code:

#install required packages
if(!require(pacman))install.packages("pacman")
pacman::p_load('rvest', 'stringi', 'dplyr', 'tidyr', 'measurements', 'reshape2','foreach','doParallel','raster','curl','httr','Iso')

profile_detail<-read_html('https://www.pgatour.com/players/player.01006.john-adams.html#profile')%>%html_node("[class='s-header__bottom']")%>%html_children()

But this code is not giving me the desired result. Instead, getting one one node:

[1] <div class="s-header__no-data">No additional profile information available</div>

Not sure how to access the div class of 's-col'

Here is the snippet of the players info I want to extract:

Can anyone help me with this please?

Thanks in advance!


Solution

  • You could use div.s-col in html_nodes :

    library(rvest)
    url <- 'https://www.pgatour.com/players/player.06197.michael-allen.html'
    
    url %>%
      read_html() %>%
      html_nodes('div.s-col') %>%
      html_text() %>%
      gsub('\\h+', ' ', ., perl = TRUE) %>%
      cat
    

    I am not sure how you want your final expected output to look but this returns :

     #Michael Allen 
     #Full Name
     
     
     #6 ft, 0 in
     #183 cm
     #Height
     
     
     #195 lbs
     #89 kg
     #Weight
     
     
     #January 31, 1959
     #Birthday
     
     
     #61
     #AGE
     
     
     #San Mateo, California
     #Birthplace
     
     
     #Scottsdale, Arizona
     #Residence
     
     
     #Wife, Cynthia; Christy (12/8/93), Michelle (6/3/97)
     #Family
     
     
     #University of Nevada (1982, Horticulture) 
     #College
     
     
     #1984
     #Turned Pro
     
     
     #16,963,593
     #Career Earnings
     
     
     #Paradise Valley, AZ, United States
     #City Plays From
     
    
    

    Note that some of the players don't have their personal information on the page.