When I try to read a previously saved CSV file using data.table
function fread
, the categorical order of my data is not preserved. It gets formatted in alphabetical order.
To replicate this issue, I have created a fake dataset using data.table
dat <- data.table(name = c("Joe", "Bob", "Steve", "Lucy", "Eric", "Marshall","Henry"),
subject = as.factor(c(4,1,2,3,4,3,2)))
Using setattr
function, I then label the levels of the factor column named subject
.
setattr(dat$subject,
"levels",
c("Math","Biology","Sport", "ICT"))
This is what the dataset looks like.
name subject
1: Joe ICT
2: Bob Math
3: Steve Biology
4: Lucy Sport
5: Eric ICT
6: Marshall Sport
7: Henry Biology
I examine the structure of the dataset and the order of the levels within the subject factor. The subject
column is factor and levels are in the exact same order as I set them.
str(dat)
Classes ‘data.table’ and 'data.frame': 7 obs. of 2 variables:
$ name : chr "Joe" "Bob" "Steve" "Lucy" ...
$ subject: Factor w/ 4 levels "Math","Biology",..: 4 1 2 3 4 3 2
- attr(*, ".internal.selfref")=<externalptr>
as.ordered(dat$subject)
Levels: Math < Biology < Sport < ICT
When I save the data set using fwrite
, and then use fread
to open it, the subject
column becomes a character and the levels are ordered alphabetically.
# save the data
fwrite(dat,
file = "dat.csv",
sep = "\t")
# read data
dat2 <- fread("dat.csv")
# check structure
str(dat2)
Classes ‘data.table’ and 'data.frame': 7 obs. of 2 variables:
$ name : chr "Joe" "Bob" "Steve" "Lucy" ...
$ subject: chr "ICT" "Math" "Biology" "Sport" ...
- attr(*, ".internal.selfref")=<externalptr>
# check order of the levels in subject
as.ordered(dat2$subject)
Levels: Biology < ICT < Math < Sport
The situation stills persists when I use the colClasses argument and declare subject
column as a factor.
Question
Why is the fread
( or fwrite
) function in data.table
not preserving the subject column as a factor. And when this is controlled for using the colClasses argument to specify the subject
column as a factor, why is the hierarchical order of the levels in the subject
column not preserved?
As @mt1022 said:
This is expected behaviour, as you saved the factor column as character strings. When you read it again, fread or other data import functions have no idea of the original factor levels. If you want to preserve the attributes of the data, consider saving it as a .RDS file.