Search code examples
runicodeasciinon-ascii-characters

Removing non-ASCII characters from data files


I've got a bunch of csv files that I'm reading into R and including in a package/data folder in .rdata format. Unfortunately the non-ASCII characters in the data fail the check. The tools package has two functions to check for non-ASCII characters (showNonASCII and showNonASCIIfile) but I can't seem to locate one to remove/clean them.

Before I explore other UNIX tools, it would be great to do this all in R so I can maintain a complete workflow from raw data to final product. Are there any existing packages/functions to help me get rid of the non-ASCII characters?


Solution

  • To simply remove the non-ASCII characters, you could use base R's iconv(), setting sub = "". Something like this should work:

    x <- c("Ekstr\xf8m", "J\xf6reskog", "bi\xdfchen Z\xfcrcher") # e.g. from ?iconv
    Encoding(x) <- "latin1"  # (just to make sure)
    x
    # [1] "Ekstrøm"         "Jöreskog"        "bißchen Zürcher"
    
    iconv(x, "latin1", "ASCII", sub="")
    # [1] "Ekstrm"        "Jreskog"       "bichen Zrcher"
    

    To locate non-ASCII characters, or to find if there were any at all in your files, you could likely adapt the following ideas:

    ## Do *any* lines contain non-ASCII characters? 
    any(grepl("I_WAS_NOT_ASCII", iconv(x, "latin1", "ASCII", sub="I_WAS_NOT_ASCII")))
    [1] TRUE
    
    ## Find which lines (e.g. read in by readLines()) contain non-ASCII characters
    grep("I_WAS_NOT_ASCII", iconv(x, "latin1", "ASCII", sub="I_WAS_NOT_ASCII"))
    [1] 1 2 3