Search code examples
rweb-scrapingrvesthtmlelements

Extracting repeated class with rvest html_elements in R


how are you? I am trying to extract some info about this sportbetting webpage using rvest. I asked a related question a few days ago and i get almost 100% of my goals. So far , and thanks to you, extracted succesfully the title, the score and the time of the matches being played using the next code:

library(rvest)
library(tidyverse)

page <- "https://www.supermatch.com.uy/live_recargar_menu/" %>%
  read_html()

data=data.frame(
  Titulo = page %>%
    html_elements(".titulo") %>%
    html_text(),
  Marcador = page %>%
    html_elements(".marcador") %>%
    html_text(), 
  Tiempo = page %>%
    html_elements(".marcador+ span") %>%
    html_text() %>% 
    str_squish()
  
) 

Now i want to get repeated values, for example if the country of the match is "Brasil" I want to put it in the data frame that the country is Brasil for every match in that category. So far i only managed to extract all the countries but individually. Same applies for sport name and tournament.

Can you help me with that? Already thanks.


Solution

  • You could re-write your code to use separate functions that work with different levels of information. These can be called in a nested fashion making the code easier to read.

    Essentially, using nested map_dfr() calls to produce a single dataframe from functions working with lists at different levels within the DOM.

    Below, you could think of it like an outer loop of sports, then an intermediate loop over countries, and an innermost loop over events within a sport and country.

    library(rvest)
    library(tidyverse)
    
    get_sport_info <- function(sport) {
      df <- map_dfr(sport %>% html_elements(".category"), get_play_info)
      df$sport <- sport %>%
        html_element(".sport-name") %>%
        html_text()
      return(df)
    }
    
    
    get_play_info <- function(play) {
      df <- map_dfr(play %>% html_elements(".event"), ~
        data.frame(
          titulo = .x %>% html_element(".titulo") %>% html_text(),
          marcador = .x %>% html_element(".marcador") %>% html_text(),
          tiempo = .x %>% html_element(".marcador + span") %>% html_text() %>% str_squish()
        ))
      df$country <- play %>%
        html_element(".category-name") %>%
        html_text()
      return(df)
    }
    
    
    page <- "https://www.supermatch.com.uy/live_recargar_menu/" %>% read_html()
    sports <- page %>% html_elements(".sport")
    final <- map_dfr(sports, get_sport_info)