Search code examples
rdiacriticstm

Cleaning accent in text twitter


I am working in text mining with spanish twitts, my problem is that i have the same words but in differents ways (with accent and without accent), example: accion, acción.

I tried to use coding: unicode "UTF-8", but dont work. my library library(stringi) library(twitteR) library(tm) library(wordcloud) library(RColorBrewer)


Solution

  • You did not specify clearly what you are trying to do with accessed tweets ( saving in a text file, or as a dataframe etc.) If you are using UTF-8 encoding it will basically preserve the letter as it is.

     con <- file("C:/Dir1/sub_dir1/output/output.txt", encoding = "UTF-8")
     write(df, file = con)
    

    However, if you are trying to change this accent characters into normal equivalent The simplest way would be using iconv

    iconv( "acción", to='ASCII//TRANSLIT')
    >[1] "accion"