Search code examples
rdatasetsplitsample

split dataset into multiple datasets with random columns in r


I have a big dataset. I want to divide into "n" number of sub-dataset each with equal size "s". However the last data set may be less than other size if it is not divisible by number. And output them as csv file to working directory.

Lets say the following small example:

set.seed(1234)
mydf <- data.frame (matrix(sample(1:10, 130, replace = TRUE), ncol = 13))
mydf

   X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13
1   3  7  1  9  6  4  7  5  8   2   2   2   8
2   5  3  4  6  9  5  3 10  5   8  10   2  10
3   4  6 10  4  4  6  3  4  2   9   9   2   9
4  10 10  9  4  3  7  7  7 10   6   7  10   2
5  10  3  9  3  2 10  9  6  4   4   4   6   3
6   7  2  8  7  5  5 10 10  9   3   7   8   4
7   3  2  2  7 10  9  2  2 10   1   1  10   4
8   3  9  9  7  3  1  7  6 10   3  10   3   2
9   9  3  6  9  3  2  2  3  4   2   9  10  10
10  6  4  3  3  5  9  3  9 10   7   4   6  10

I want to create a function that randomly split the dataset in into n subsets (in this case say of size 3, as there are 13 columns - the last dataset will have 1 column rest 4 each have 3) and output as text file as separate dataset.

Here is what I did:

set.seed(123)
reshuffled <- sample(1:length(mydf),length(mydf), replace = FALSE)
# just crazy manual divide 
group1 <- reshuffled[1:3]; group2 <- reshuffled[4:6]; group3 <- reshuffled[7:9]
group4 <- reshuffled[10:12]; group5 <-  reshuffled[13]

# just manual 
data1 <- mydf[,group1]; data2 <- mydf[,group2]; ....so on;
# I want to write dimension of dataset at fist row of each dataset 
cat (dim(data1))
write.csv(data1, "data1.csv");  write.csv(data2, "data2.csv"); .....so on 

Is it possible to loop the process as I have to generate 100 sub datasets?


Solution

  • Maybe there is a cleaner and simpler solution, but you can try the following :

    mydf <- data.frame (matrix(sample(1:10, 130, replace = TRUE), ncol = 13))
    
    ## Number of columns for each sub-dataset
    size <- 3
    
    nb.cols <- ncol(mydf)
    nb.groups <- nb.cols %/% size
    reshuffled <- sample.int(nb.cols, replace=FALSE)
    groups <- c(rep(1:nb.groups, each=size), rep(nb.groups+1, nb.cols %% size))
    dfs <- lapply(split(reshuffled, groups), function(v) mydf[,v,drop=FALSE])
    
    for (i in 1:length(dfs)) write.csv(dfs[[i]], file=paste("data",i,".csv",sep=""))