I extracted some text from a web page.
But I have some whitespaces or speciual characters that I can not remove easily.
I tried this:
library(dplyr)
library(rvest)
url <- "http://www.scielo.org.mx/scielo.php?script=sci_arttext&pid=S1607-40412016000100014&lang=es"
page <- read_html(url)
referenes_whitout_end_spaces <- page %>%
html_elements("p") %>%
.[grepl("(Links)|(doi:)", as.character(.))] %>%
html_text() %>%
gsub("[\n\t\b]", "", .) %>%
gsub("\\[.*Links.*\\]", "", .) %>%
gsub("\\s|\\n", " ", .) %>%
trimws("both", whitespace = "[ \t\r\n\b]")
referenes_whitout_end_spaces
but the whitespaces at the end of the references stands.
how I can remove this whitespaces?
The issue is that the HTML page contains a lot of
HTML entities standing for non-breaking spaces. These entities are converted to literal non-breaking spaces, \xA0
.
Thus, you can simply add them to the trimws
function:
trimws("both", whitespace = "[ \xA0\t\r\n\b]")
Or, if you want to support all Unicode whitespace:
trimws("both", whitespace = "\\p{Z}+")