I am calculating the correlation between two data sets but due to the big size of the data (10 GB) while my RAM is only 6 GB I am facing a memory issue. I wonder how can to chunk my code?
dir1 <- list.files("D:sdr", "*.bin", full.names = TRUE)
dir2 <- list.files("D:dsa", "*.img", full.names = TRUE)
file_tot<-array(dim=c(1440,720,664,2))
for(i in 1:length(dir1)){
file_tot[,,i,1] <- readBin(dir1[i], numeric(), size = 4 ,n = 1440 * 720 , signed = T)
file_tot[,,i,2] <- readBin(dir2[i], integer(), size = 2 ,n = 1440 * 720 , signed = F)
file_tot[,,i,2] <- file_tot[,,i,2]*0.000030518594759971
file_tot[,,i,2][file_tot[,,i,2] == 9999 ] <- NA
}
result<-apply(file_tot,c(1,2),function(x){cor(x[,1],x[,2])})
But got this error:
Error: cannot allocate vector of size 10.3 Gb
In addition: Warning messages:
1: In file_tot[, , i, 1] <- readBin(dir1[i], numeric(), size = 4, n = 1440 * :
Reached total allocation of 16367Mb: see help(memory.size)
2: In file_tot[, , i, 1] <- readBin(dir1[i], numeric(), size = 4, n = 1440 * :
Reached total allocation of 16367Mb: see help(memory.size)
3: In file_tot[, , i, 1] <- readBin(dir1[i], numeric(), size = 4, n = 1440 * :
Reached total allocation of 16367Mb: see help(memory.size)
4: In file_tot[, , i, 1] <- readBin(dir1[i], numeric(), size = 4, n = 1440 * :
Reached total allocation of 16367Mb: see help(memory.size)
If you are only calculating this correlation, you don't really need to switch to packages such are ff
or bigmemory
. You can just process your files in chunks. When you are planning on doing more analyses using one of the big data packages might be useful.
Here is an example how you might process your files chunkwise:
# Generate some data; in this case I only use 7 columns,
# but it should scale to any number of columns (except
# perhaps generating the files)
dim <- c(1440, 7, 664, 2)
# The last line should be replaced by the next for the data in
# the question
# dim <- c(1440, 770, 664, 2)
for (i in seq_len(dim[3])) {
dat <- rnorm(dim[1]*dim[2])
writeBin(dat, paste0("file", i, ".bin"), size = 4)
dat <- rnorm(dim[1]*dim[2])
writeBin(dat, paste0("file", i, ".img"), size = 4)
}
dir1 <- list.files("./", "*.bin", full.names = TRUE)
dir2 <- list.files("./", "*.img", full.names = TRUE)
result <- array(dim=c(dim[1], dim[2]))
file_tot<-array(dim=c(dim[1], dim[3], dim[4]))
# Proces the files column by column
for (j in seq_len(dim[2])) {
for(i in 1:length(dir1)){
# Open first file
con <- file(dir1[i], 'rb')
# Skip to the next column
seek(con, (j-1)*dim[1]*4)
# Read colum
file_tot[,i,1] <- readBin(con, numeric(), size = 4 ,n = dim[1])
close(con)
# And repeat for the next file
con <- file(dir2[i], 'rb')
seek(con, (j-1)*dim[1]*4)
file_tot[,i,2] <- readBin(con, numeric(), size = 4 ,n = dim[1])
# For the datasets in the example the previous line should be replaced
# by the next three:
#file_tot[,i,2] <- readBin(con, integer(), size = 2 ,n = dim[1] , signed = F)
#file_tot[,i,2] <- file_tot[,i,2]*0.000030518594759971
#file_tot[,i,2][file_tot[,i,2] == 9999 ] <- NA
close(con)
}
result[,j] <-apply(file_tot,c(1),function(x){cor(x[,1],x[,2])})
}