Search code examples
rdata.tablefwritefread

Importing a csv file using fread loses factor order


When I try to read a previously saved CSV file using data.table function fread, the categorical order of my data is not preserved. It gets formatted in alphabetical order.

To replicate this issue, I have created a fake dataset using data.table

dat <- data.table(name = c("Joe", "Bob", "Steve", "Lucy", "Eric", "Marshall","Henry"), 
              subject  = as.factor(c(4,1,2,3,4,3,2)))

Using setattr function, I then label the levels of the factor column named subject.

setattr(dat$subject,
    "levels",
    c("Math","Biology","Sport", "ICT"))

This is what the dataset looks like.

       name subject
1:      Joe     ICT
2:      Bob    Math
3:    Steve Biology
4:     Lucy   Sport
5:     Eric     ICT
6: Marshall   Sport
7:    Henry Biology

I examine the structure of the dataset and the order of the levels within the subject factor. The subject column is factor and levels are in the exact same order as I set them.

str(dat) 

   Classes ‘data.table’ and 'data.frame':   7 obs. of  2 variables:
 $ name   : chr  "Joe" "Bob" "Steve" "Lucy" ...
 $ subject: Factor w/ 4 levels "Math","Biology",..: 4 1 2 3 4 3 2
 - attr(*, ".internal.selfref")=<externalptr> 

as.ordered(dat$subject)

Levels: Math < Biology < Sport < ICT

When I save the data set using fwrite, and then use fread to open it, the subject column becomes a character and the levels are ordered alphabetically.

# save the data
fwrite(dat,
       file = "dat.csv",
       sep = "\t")

# read data
dat2 <- fread("dat.csv")

# check structure 
str(dat2)

Classes ‘data.table’ and 'data.frame':  7 obs. of  2 variables:
 $ name   : chr  "Joe" "Bob" "Steve" "Lucy" ...
 $ subject: chr  "ICT" "Math" "Biology" "Sport" ...
 - attr(*, ".internal.selfref")=<externalptr> 

# check order of the levels in subject
as.ordered(dat2$subject)

Levels: Biology < ICT < Math < Sport

The situation stills persists when I use the colClasses argument and declare subject column as a factor.

Question Why is the fread ( or fwrite) function in data.table not preserving the subject column as a factor. And when this is controlled for using the colClasses argument to specify the subject column as a factor, why is the hierarchical order of the levels in the subject column not preserved?


Solution

  • As @mt1022 said:

    This is expected behaviour, as you saved the factor column as character strings. When you read it again, fread or other data import functions have no idea of the original factor levels. If you want to preserve the attributes of the data, consider saving it as a .RDS file.