How do I remove the below chars from tweets in a R dataframe using regex
அனà¯à®ªà¯à®®à¯ பாசமà¯à®®à¯ நிறைநà¯à®¤ இஸà¯à®²à®¾à®®à®¿à®¯ சகோதர சகோதரிகள௠கà¯à®•à¯ à®°à®®à¯à®œà®¾à®©à¯ நலà¯à®µà®¾à®´à¯à®¤à¯à®¤à¯à®•à¯à®•à®³à¯ …
Thanks in advance. :)
The answer goes out to Rushabh. You can use iconv
which converts the strings with one encoding to another and substitutes nonconversable charaters with the value given in argement sub
:
foo <- "அனà¯à®ªà¯à®®à¯ பாசமà¯à®®à¯ நிறைநà¯à®¤ இஸà¯à®²à®¾à®®à®¿à®¯ சகோதர சகோதரிகள௠கà¯à®•à¯ à®°à®®à¯à®œà®¾à®©à¯ நலà¯à®µà®¾à®´à¯à®¤à¯à®¤à¯à®•à¯à®•à®³à¯ …"
iconv(foo, from = "UTF-8", to = "ASCII", sub = "")
Output:
[1] "aaaaaaa aaasaaaa aaaaaaa aaaaaaaa asaaaa asaaaaaaaa aaaa aaaaaaa aaaaaaaaaaaaaaaa a"