Search code examples
rencoding

Fix text encoding in R


I am having an issue with text encoding that I cannot solve.

I have a string in an excel file that I'm reading into R that looks like: Productâ„¢. With a bit of research, I learned that the â„¢ is UTF-8 that has been read incorrectly as CP-1252.

The UTF-8 hex code for ™ is 0xe2 0x84 0xa2. This has been read as CP-1525: â (E2) „ (84) ¢ (A2).

How can I fix this issue? I have tried using:

iconv("Productâ„¢", "cp1252", "utf-8")

#> [1] "Productâ„¢"

But as you can see, the output is incorrect. The desired output is Product™.

Any ideas about how to fix this issue? The incorrect data is in an Excel spreadsheet, but I am trying to clean the text in R. A solution to fix the original data or a data cleaning solution in R would be great.


Solution

  • Update: I had the arguments backwards. Turns out the text was being read as UTF-8 while it really should've been CP-1252. I was able to solve by using:

    iconv("Productâ„¢", "utf-8", "cp1252")
    
    #> [1] "Product™"
    

    Special thanks to @BalusC and this answer which showed me how to identify which encodings were being used erroneously.