Search code examples
rdataframedplyrdata.tableplyr

Change column class based on other dataframe


I have a data frame and I am trying to convert class of each variable of dt based on col_type.

Find example below for more detail.

> dt
  id <- c(1,2,3,4)
   a <- c(1,4,5,6)
   b <- as.character(c(0,1,1,4))
   c <- as.character(c(0,1,1,0))
   d <- c(0,1,1,0)
  dt <- data.frame(id,a,b,c,d, stringsAsFactors = FALSE)

> str(dt)
'data.frame':   4 obs. of  5 variables:
 $ id: num  1 2 3 4
 $ a : num  1 4 5 6
 $ b : chr  "0" "1" "1" "4"
 $ c : chr  "0" "1" "1" "0"
 $ d : num  0 1 1 0

Now, I am trying to convert class of each column based on below data frame.

> var  
  var <- c("id","a","b","c","d")
  type <- c("character","numeric","numeric","integer","character")
  col_type <- data.frame(var,type, stringsAsFactors = FALSE)


> col_type
  var      type
1  id character
2   a   numeric
3   b   numeric
4   c   integer
5   d character

I want to convert id to class mention in col_type data frame and so on for all other columns.

My Attempts:

setDT(dt)
for(i in 1:ncol(dt)){
  if(colnames(dt)[i]%in%col_type$var){
    a <- col_type[col_type$var==paste0(intersect(colnames(dt)[i],col_type$var)),]
    dt[,col_type$var[i]:=eval(parse(text = paste0("as.",col_type$type[i],"(",col_type$var[i],")")))]
  }
  
}

Note- My solution works but it is really slow and I am wondering if I can do it more efficiently and cleanly.

Suggestions will be appreciated.


Solution

  • I would read the data in with the colClasses argument derived from the col_type table:

    library(data.table)
    library(magrittr)
    setDT(col_type)
    
    res = capture.output(fwrite(dt)) %>% paste(collapse="\n") %>% 
      fread(colClasses = col_type[, setNames(type, var)])
    
    str(res)
    Classes ‘data.table’ and 'data.frame':  4 obs. of  5 variables:
     $ id: chr  "1" "2" "3" "4"
     $ a : num  1 4 5 6
     $ b : num  0 1 1 4
     $ c : int  0 1 1 0
     $ d : chr  "0" "1" "1" "0"
     - attr(*, ".internal.selfref")=<externalptr> 
    

    If you can do this when the data is read in initially, it simplifies to...

     res = fread("file.csv", colClasses = col_type[, setNames(type, var)])
    

    It's straightforward to do all of this without data.table.


    If somehow the data is never read into R (received as RDS?), there's:

    setDT(dt)
    res = dt[, Map(as, .SD, col_type$type), .SDcols=col_type$var]
    
    str(res)
    Classes ‘data.table’ and 'data.frame':  4 obs. of  5 variables:
     $ id: chr  "1" "2" "3" "4"
     $ a : num  1 4 5 6
     $ b : num  0 1 1 4
     $ c : int  0 1 1 0
     $ d : chr  "0" "1" "1" "0"
     - attr(*, ".internal.selfref")=<externalptr> 
    

    See showMethods("coerce") as some conversions might fail, e.g.: as(letters[1:3], "factor")