I have the task to make a function which takes a path to a directory, reads a lot of .csv files and returns a data.frame with the number of complete cases for each file in the form:
## id nobs
## 1 2 1041
## 2 4 474
## 3 8 192
## 4 10 148
## 5 12 96
I have the following solution (function signature is given):
complete <- function(directory, id = 1:332) {
myFiles <- list.files(path=directory,pattern=".csv",recursive=T,full.names=T)
data <- lapply(myFiles[id],read.csv)
frame <- do.call("rbind",data)
frame <- frame[complete.cases(frame),]
frame$ID <- factor(frame$ID, ordered=T)
by <- by(frame,frame$ID,nrow,simplify=F)
complete <- data.frame(id=names(by),nobs=unlist(by))
return(complete)
}
That gives me the correct output, except one situtation. If the function call is something like complete(directory, 30:25)
it's expected, that the order of the data.frame column id
is preserved (here 30,29, etc.). But that fails because by
is sorting the output list by factors. Is there a better solution for my problem (using standard packages)? Or can I inhibit the ordering?
I don't think that ordered=
parameter is doing what you think it is. When you set ordered=T
it creates an ordered factor which is analogous to an ordinal variable where as a regular factor behaves more like a categorical variable. It does not assume the vector is already ordered nor does it affect the sorting of the vector in any way.
If you want to specify a given order, you must use
frame$ID <- factor(frame$ID, levels=unique(frame$ID))
and then by
should behave as expected.