Search code examples
rspecial-charactersgsub

How to remove "ÂÂ" from scraped in text in R?


After I have scraped in some text "ÂÂ" will be added after certain words and numbers in the text. To remove the unwanted "ÂÂ" I have done a couple of gsubs.

text <- gsub("Ã", " ", text)
text <- gsub("Â", " ", text)
text <- gsub(",", "", text)
text <- gsub("  ", " ", text)

This works in removing the special characters A's but the comma is not taken out.

What text looks like before gsubs.

 ALBANY OFF REBOUND BY  #43 STIRE       #43 STIRE is not commented out

What text looks like after gsubs.

 ALBANY ‚  OFF ‚  REBOUND BY #43 ‚  STIRE        #43 ‚  STIRE is not commented out

What I would like the text to look like:

 ALBANY OFF REBOUND BY #43 STIRE                 #43 STIRE is not commented out

Any help will be appreciated. Please let me know if any further information is needed.


Solution

  • You could use library(stringr)

    text <- "ALBANYÃ, OFFÃ, REBOUND BY"
    
    library(stringr)
    str_replace_all(text, "Ã,Â", "")
    #> [1] "ALBANY OFF REBOUND BY"
    

    or with gsub :

    gsub("Ã,Â","",text)
    #> [1] "ALBANY OFF REBOUND BY"
    

    However, I think it is an encoding issue in the first place. Moreover results of gsub or str_replace_all may difer with encoding, it could be why your text <- gsub(",", "", text) do not work.

    You could check encoding with Encoding.