Search code examples
rregextweets

How to remove special chars like ன from string in R using regex


How do I remove the below chars from tweets in a R dataframe using regex

அனà¯à®ªà¯à®®à¯ பாசமà¯à®®à¯ நிறைநà¯à®¤ இஸà¯à®²à®¾à®®à®¿à®¯ சகோதர சகோதரிகள௠கà¯à®•à¯ à®°à®®à¯à®œà®¾à®©à¯ நலà¯à®µà®¾à®´à¯à®¤à¯à®¤à¯à®•à¯à®•à®³à¯ …

Thanks in advance. :)


Solution

  • The answer goes out to Rushabh. You can use iconv which converts the strings with one encoding to another and substitutes nonconversable charaters with the value given in argement sub:

    foo <- "அனà¯à®ªà¯à®®à¯ பாசமà¯à®®à¯ நிறைநà¯à®¤ இஸà¯à®²à®¾à®®à®¿à®¯ சகோதர சகோதரிகள௠கà¯à®•à¯ à®°à®®à¯à®œà®¾à®©à¯ நலà¯à®µà®¾à®´à¯à®¤à¯à®¤à¯à®•à¯à®•à®³à¯ …"
    iconv(foo, from = "UTF-8", to = "ASCII", sub = "")
    

    Output:

    [1] "aaaaaaa aaasaaaa aaaaaaa aaaaaaaa asaaaa asaaaaaaaa aaaa aaaaaaa aaaaaaaaaaaaaaaa a"