I am trying to extract some soccer data from the fbref.com site, specifically I should extract some dates and I would like to understand how to filter the various nodes within the site Hello, I would like to extract some data from fbref but I cannot extract only a certain type of data. I will explain better by attaching the html code in question:
<th scope="row" class="left " data-stat=**"date"** csk="20230819"><a href="/it/partite/254420f7/Internazionale-Monza-19-Agosto-2023-Serie-A">19-08-2023</a></th>
<a href="/it/partite/254420f7/Internazionale-Monza-19-Agosto-2023-Serie-A">19-08-2023</a>
Reading the code:
url <- https://fbref.com/it/squadre/d609edc0/Statistiche-Internazionale
html_data <- read_html(url)
html_data %>%
html_nodes(".left ")
It reads more or less 1266 different nodes, but I am only interested in extracting the text where "data-stats='date'. By getting only those nodes I should be able to later extract the date after the "href".
One can use the html_attr()
function to extract out the value in the attribute and check to see if it is the desired value.
library(rvest)
url <- "https://fbref.com/it/squadre/d609edc0/Statistiche-Internazionale"
html_data <- read_html(url)
foundnodes <- html_data %>% html_nodes(".left ")
#extract the attribute and check to see if it is equal to date
nodes_date <- which(html_attr(foundnodes, "data-stat")=="date")
#subset the foundnodes
foundnodes[nodes_date]
Or do it one statement with:
#find the nodes with class = .left then check that attribute "data-stat" is equal to "date"
html_data %>% html_elements(".left[data-stat='date']")