Search code examples
rweb-scrapingrvesthtmltext

I get a different html text that the one on the web when scraping with rvest in Rstudio


so im trying to make a calendar (dataframe) with the soccer matches coming. Im webscraping the columns one by one because i dont need them all. When scraping the column with the timedate (HORA) i get a values that are incorrect, dont know why... i dont think it has to be with timezone because its just text.

library(rvest)
url <- "https://www.cruzados.cl/competitions/campeonato-nacional"
page <- read_html(url)

hora_inicio <- page %>% html_nodes("td.team-schedule__time") %>% html_text()

> hora_inicio
[1] "21:00" "22:30" "23:15" "22:30" "00:30" "00:00" "02:00" "02:00" "19:00" "22:00" "19:00" "22:15" "19:00" "02:00"
[15] "19:00" "19:00" "19:00" "19:00" "19:00" "19:00" "19:00" "19:00" "19:00" "19:00" "19:00" "19:00" "19:00" "19:00"
[29] "19:00" "19:00" "19:00" "19:00" "19:00" "20:00" "20:00" "20:00" "20:00" "20:00" "20:00"

the right ones are: 18:00, 19:30, 19:15, 18:30, 20:30, 20:00, 18:00 , ...


Solution

  • In fact, the datetime shown in the html result is in UTC timezone. JS is updating the result according to your timezone.

    The following will extract the dates and times, combine them and convert UTC datetimes into your current timezone :

    library(rvest)
    
    url <- "https://www.cruzados.cl/competitions/campeonato-nacional"
    page <- read_html(url)
    
    Sys.setlocale(locale="es_ES.UTF-8")
    
    date <- page %>% html_nodes("td.team-schedule__date") %>% html_text()
    time <- page %>% html_nodes("td.team-schedule__time") %>% html_text()
    
    dates <- as.Date(gsub("sept", "sep", date), format="%a. %d / %b. / %Y") #dom. 21 / mar. / 2021
    
    i <- 1
    tzDates <- list()
    for(date in as.list(dates)) {
      utcDate <- as.POSIXct(paste0(format(date, "%Y-%m-%d")," ",time[i]), format="%Y-%m-%d %H:%M",tz = "UTC")
      tzDates[[i]] <- as.POSIXlt(utcDate, tz = Sys.timezone())
      i <- i+1
    }
    print(tzDates)
    

    You will need the locale es_ES.UTF-8 or es_CL.UTF-8 to be installed in order to get the abbreviated month/weekday in spanish.


    In my case, I'm located in France, you can see the time change on 28th march from UTC+1 to UTC+2 :

    [1] "2021-03-21 22:00:00 CET"
    [2] "2021-03-29 00:30:00 CEST"
    [3] "2021-04-05 01:15:00 CEST"
    

    An the html returned is (UTC) :

    [1] "2021-03-21 21:00"
    [2] "2021-03-28 22:30"
    [3] "2021-04-04 23:15"