specially hard to remove right whitespaces in text with R

I extracted some text from a web page.

But I have some whitespaces or speciual characters that I can not remove easily.

I tried this:

library(dplyr)
library(rvest)

url <- "http://www.scielo.org.mx/scielo.php?script=sci_arttext&pid=S1607-40412016000100014&lang=es"

page <- read_html(url)
referenes_whitout_end_spaces <- page %>%
  html_elements("p")  %>% 
  .[grepl("(Links)|(doi:)", as.character(.))] %>%
  html_text() %>% 
  gsub("[\n\t\b]", "", .) %>%
  gsub("\\[.*Links.*\\]", "", .) %>%
  gsub("\\s|\\n", " ", .) %>% 
trimws("both", whitespace = "[ \t\r\n\b]")

referenes_whitout_end_spaces

but the whitespaces at the end of the references stands.

how I can remove this whitespaces?

Solution

The issue is that the HTML page contains a lot of   HTML entities standing for non-breaking spaces. These entities are converted to literal non-breaking spaces, \xA0.

Thus, you can simply add them to the trimws function:

trimws("both", whitespace = "[ \xA0\t\r\n\b]")

Or, if you want to support all Unicode whitespace:

trimws("both", whitespace = "\\p{Z}+")