I have a list of data frames, if each df have multiple vars. All variables are characters. I found that there are some special characters (such as \x96) in some variables which prevent my codes to run. I have to convert them or remove them. Is it a smart way to remove them all at once?
currently I found iconv(lst[["df1"]]$var1, from ="UTF-8", to="ASCII", sub = "")
can take care one variable in a df.🤣 However it is impossible for me to go over the whole list of df to find each of variables that have this issue. Is it a way to loop or map this so all "UTF-8" can be converted to ASCII?
I am open to other suggestion. Tried to use gsub and could not get it to work for /x96
. If you happen to know how, please kindly share.
Don't know how to create a dummy data list. The structure should looks like:
lst
df1 (10 vars)
df2 (5 vars)
df3 (4 vars)
Does this work for you? I don't know how your List looks like, so this is just a general approach. Use lapply
to iterate over your data-frame-list lst
. Since you don't know which columns have this issue, you can add a small improvement !all(validUTF8(x))
(checks for false utf8 encoding)
lst <- list(
df1 = data.frame(
var1 = c("text\x96here", "normal"),
var2 = c("more\x96text", "clean"),
stringsAsFactors = FALSE
),
df2 = data.frame(
var1 = c("some\x96where", "fine"),
stringsAsFactors = FALSE
)
)
lst_clean <- lapply(lst, function(df) {
df[] <- lapply(df, function(x) {
if(is.character(x) & !all(validUTF8(x))) {
iconv(x, from = "UTF-8", to = "ASCII", sub = "")
} else {
x
}
})
df
})
giving
$df1
var1 var2
1 texthere moretext
2 normal clean
$df2
var1
1 somewhere
2 fine