Search code examples
htmlcssrweb-scrapingrvest

Extracting only certain nodes in web scraping on R


I am trying to extract some soccer data from the fbref.com site, specifically I should extract some dates and I would like to understand how to filter the various nodes within the site Hello, I would like to extract some data from fbref but I cannot extract only a certain type of data. I will explain better by attaching the html code in question:

<th scope="row" class="left " data-stat=**"date"** csk="20230819"><a href="/it/partite/254420f7/Internazionale-Monza-19-Agosto-2023-Serie-A">19-08-2023</a></th>
    <a href="/it/partite/254420f7/Internazionale-Monza-19-Agosto-2023-Serie-A">19-08-2023</a>

Reading the code:

url <- https://fbref.com/it/squadre/d609edc0/Statistiche-Internazionale

html_data <- read_html(url)

html_data %>%
  html_nodes(".left ")

It reads more or less 1266 different nodes, but I am only interested in extracting the text where "data-stats='date'. By getting only those nodes I should be able to later extract the date after the "href".


Solution

  • One can use the html_attr() function to extract out the value in the attribute and check to see if it is the desired value.

    library(rvest)
    url <- "https://fbref.com/it/squadre/d609edc0/Statistiche-Internazionale"
    
    html_data <- read_html(url)
    
    foundnodes <- html_data %>% html_nodes(".left ") 
    
    #extract the attribute and check to see if it is equal to date
    nodes_date <- which(html_attr(foundnodes, "data-stat")=="date")
    
    #subset the foundnodes
    foundnodes[nodes_date]
    

    Or do it one statement with:

    #find the nodes with class = .left then check that attribute "data-stat" is equal to "date"    
    html_data  %>% html_elements(".left[data-stat='date']")