I'm working a data frame which consists of multiple different data types (numerics, characters, timestamps), but unfortunately all of them are received as characters. Hence I need to coerce them into their "appropriate" format dynamically and as efficiently as possible.
Consider the following example:
df <- data.frame("val1" = c("1","2","3","4"), "val2" = c("A", "B", "C", "D"), stringsAsFactors = FALSE)
I obviously want val1
to be numeric and val2
to remain as a character. Therefore, my result should look like this:
'data.frame': 4 obs. of 2 variables:
$ val1: num 1 2 3 4
$ val2: chr "A" "B" "C" "D"
Right now I'm accomplishing this by checking if the coercion would result in NULL
and then proceeding in coercing if this isn't the case:
res <- as.data.frame(lapply(df, function(x){
x <- sapply(x, function(y) {
if (is.na(as.numeric(y))) {
return(y)
} else {
y <- as.numeric(y)
return(y)
}
})
return(x)
}), stringsAsFactors = FALSE)
However, this doesn't strike me as the correct solution because of multiple issues:
In FUN(X[[i]], ...) : NAs introduced by coercion
, although this isn't the case (see result)Is there a general, heuristic approach to this, or another, more sustainable solution? Thanks
The recent file readers like data.table::fread
or the readr
package do a pretty decent job in identifying and converting columns to the appropriate type.
So my first reaction was to suggest to write the data to file and read it in again, e.g.,
library(data.table)
fwrite(df, "dummy.csv")
df_new <- fread("dummy.csv")
str(df_new)
Classes ‘data.table’ and 'data.frame': 4 obs. of 2 variables: $ val1: int 1 2 3 4 $ val2: chr "A" "B" "C" "D" - attr(*, ".internal.selfref")=<externalptr>
or without actually writing to disk:
df_new <- fread(paste(capture.output(fwrite(df, "")), collapse = "\n"))
However, d.b's suggestions are much smarter but need some polishing to avoid coercion to factor:
df[] <- lapply(df, type.convert, as.is = TRUE)
str(df)
'data.frame': 4 obs. of 2 variables: $ val1: int 1 2 3 4 $ val2: chr "A" "B" "C" "D"
or
df[] <- lapply(df, readr::parse_guess)