Search code examples
rweb-scrapingxpathrvestnon-ascii-characters

Rvest not returning html_nodes whan id of xpath has an accent in R


I am trying to scrape a table from an html file using Rvest in R. But html_node is not working, I think it is because the id in the xpath is in Spanish and has an accent.

Here is the code:

library(rvest)
library(xml2)

url <- "https://www3.ine.gub.uy/boletin/Boletin%20Ingresos%204to%20trimestre%202021.html"
html <- read_html(url)
data <- html_node(html, xpath='//*[@id="ingreso-medio-per-cápita"]/table/tbody')

I have been Googling a lot but I cannot find a solution.
I would really appreciate if someone could help me!


Solution

  • I'm not sure what the problem is here, but since the string up to the accented character is still unique, you can get it using the xpath function starts-with

    library(rvest)
    library(xml2)
    
    url <- "https://www3.ine.gub.uy/boletin/Boletin%20Ingresos%204to%20trimestre%202021.html"
    html <- read_html(url)
    
    xpath <- '//div[starts-with(@id,"ingreso-medio-per-c")]/table'
    data <- html_table(html_nodes(html, xpath = xpath))[[1]][1:3,]
    #> Warning in table_fill(cells, trim = trim): NAs introduced by coercion
    
    data
    #> # A tibble: 3 x 3
    #>   ``         `Trimestre 3 2021` `Trimestre 4 2021`
    #>   <chr>                   <dbl>              <dbl>
    #> 1 Total país               25.8               26.6
    #> 2 Montevideo               32.5               33.5
    #> 3 Interior                 21.5               22.3
    

    Created on 2022-02-15 by the reprex package (v2.0.1)