Search code examples
r

how to convert utf-8 to ascii for a list of data frames


I have a list of data frames, if each df have multiple vars. All variables are characters. I found that there are some special characters (such as \x96) in some variables which prevent my codes to run. I have to convert them or remove them. Is it a smart way to remove them all at once?

currently I found iconv(lst[["df1"]]$var1, from ="UTF-8", to="ASCII", sub = "") can take care one variable in a df.🤣 However it is impossible for me to go over the whole list of df to find each of variables that have this issue. Is it a way to loop or map this so all "UTF-8" can be converted to ASCII?

I am open to other suggestion. Tried to use gsub and could not get it to work for /x96. If you happen to know how, please kindly share.

Don't know how to create a dummy data list. The structure should looks like:

lst
  df1 (10 vars)
  df2 (5 vars)
  df3 (4 vars)

Solution

  • Does this work for you? I don't know how your List looks like, so this is just a general approach. Use lapply to iterate over your data-frame-list lst. Since you don't know which columns have this issue, you can add a small improvement !all(validUTF8(x)) (checks for false utf8 encoding)

    lst <- list(
      df1 = data.frame(
        var1 = c("text\x96here", "normal"),
        var2 = c("more\x96text", "clean"),
        stringsAsFactors = FALSE
      ),
      df2 = data.frame(
        var1 = c("some\x96where", "fine"),
        stringsAsFactors = FALSE
      )
    )
    
    lst_clean <- lapply(lst, function(df) {
      df[] <- lapply(df, function(x) {
        if(is.character(x) & !all(validUTF8(x))) {
          iconv(x, from = "UTF-8", to = "ASCII", sub = "")
        } else {
          x
        }
      })
      df
    })
    

    giving

    $df1
          var1     var2
    1 texthere moretext
    2   normal    clean
    
    $df2
           var1
    1 somewhere
    2      fine