I have a big dataset. I want to divide into "n" number of sub-dataset each with equal size "s". However the last data set may be less than other size if it is not divisible by number. And output them as csv file to working directory.
Lets say the following small example:
set.seed(1234)
mydf <- data.frame (matrix(sample(1:10, 130, replace = TRUE), ncol = 13))
mydf
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13
1 3 7 1 9 6 4 7 5 8 2 2 2 8
2 5 3 4 6 9 5 3 10 5 8 10 2 10
3 4 6 10 4 4 6 3 4 2 9 9 2 9
4 10 10 9 4 3 7 7 7 10 6 7 10 2
5 10 3 9 3 2 10 9 6 4 4 4 6 3
6 7 2 8 7 5 5 10 10 9 3 7 8 4
7 3 2 2 7 10 9 2 2 10 1 1 10 4
8 3 9 9 7 3 1 7 6 10 3 10 3 2
9 9 3 6 9 3 2 2 3 4 2 9 10 10
10 6 4 3 3 5 9 3 9 10 7 4 6 10
I want to create a function that randomly split the dataset in into n subsets (in this case say of size 3, as there are 13 columns - the last dataset will have 1 column rest 4 each have 3) and output as text file as separate dataset.
Here is what I did:
set.seed(123)
reshuffled <- sample(1:length(mydf),length(mydf), replace = FALSE)
# just crazy manual divide
group1 <- reshuffled[1:3]; group2 <- reshuffled[4:6]; group3 <- reshuffled[7:9]
group4 <- reshuffled[10:12]; group5 <- reshuffled[13]
# just manual
data1 <- mydf[,group1]; data2 <- mydf[,group2]; ....so on;
# I want to write dimension of dataset at fist row of each dataset
cat (dim(data1))
write.csv(data1, "data1.csv"); write.csv(data2, "data2.csv"); .....so on
Is it possible to loop the process as I have to generate 100 sub datasets?
Maybe there is a cleaner and simpler solution, but you can try the following :
mydf <- data.frame (matrix(sample(1:10, 130, replace = TRUE), ncol = 13))
## Number of columns for each sub-dataset
size <- 3
nb.cols <- ncol(mydf)
nb.groups <- nb.cols %/% size
reshuffled <- sample.int(nb.cols, replace=FALSE)
groups <- c(rep(1:nb.groups, each=size), rep(nb.groups+1, nb.cols %% size))
dfs <- lapply(split(reshuffled, groups), function(v) mydf[,v,drop=FALSE])
for (i in 1:length(dfs)) write.csv(dfs[[i]], file=paste("data",i,".csv",sep=""))