Search code examples
rregexrvest

specially hard to remove right whitespaces in text with R


I extracted some text from a web page.

But I have some whitespaces or speciual characters that I can not remove easily.

I tried this:

library(dplyr)
library(rvest)

url <- "http://www.scielo.org.mx/scielo.php?script=sci_arttext&pid=S1607-40412016000100014&lang=es"

page <- read_html(url)
referenes_whitout_end_spaces <- page %>%
  html_elements("p")  %>% 
  .[grepl("(Links)|(doi:)", as.character(.))] %>%
  html_text() %>% 
  gsub("[\n\t\b]", "", .) %>%
  gsub("\\[.*Links.*\\]", "", .) %>%
  gsub("\\s|\\n", " ", .) %>% 
trimws("both", whitespace = "[ \t\r\n\b]")

referenes_whitout_end_spaces

but the whitespaces at the end of the references stands.

how I can remove this whitespaces?


Solution

  • The issue is that the HTML page contains a lot of &nbsp; HTML entities standing for non-breaking spaces. These entities are converted to literal non-breaking spaces, \xA0.

    Thus, you can simply add them to the trimws function:

    trimws("both", whitespace = "[ \xA0\t\r\n\b]")
    

    Or, if you want to support all Unicode whitespace:

    trimws("both", whitespace = "\\p{Z}+")