Search code examples
rloopsbigdatachunks

Partition a large file into small files in R


I need to break a large file (14 gigabytes) into smaller files. The format of this file is txt, the tab is ";" and I know it has 70 columns (string, double). I would like to read 1million and save them in different files, file1, file2 ... fileN.

after the help of @MKR

but process was very slow, I tried using fread, but it was not possible.

How can I optimize this code?

NEW CODE

chunkSize <- 10000
conex <- file(description = db, open = "r")
data <- read.table(conex, nrows = chunkSize, header=T, fill=TRUE, sep =";")

index <- 0
counter <- 0
total <- 0
chunkSize <- 500000 
conex <- file(description=db,open="r")   
dataChunk <- read.table(conex, nrows=chunkSize, header=T, fill=TRUE,sep=";")

repeat {
dataChunk <- read.table(conex, nrows=chunkSize, header=FALSE, fill = TRUE, sep=";", col.names=db_colnames)
total <- total + sum(dataChunk$total)
counter <- counter + nrow(dataChunk)
write.table(dataChunk, file = paste0("MY_FILE_new",index),sep=";", row.names = FALSE)

  if (nrow(dataChunk) != chunkSize){
    print('linesok')
    break}
    index <- index + 1
  print(paste('lines', index * chunkSize))
}

Solution

  • You are perfectly on right track to achieve a solution.

    The approach should be:
    
    1. Read 1 million lines 
    2. Write to new files
    3. Read next 1 million lines
    4. Write to another new files
    

    Lets convert the above logic in a loop in the line of OP's attempt:

    index <- 0
    counter <- 0
    total <- 0
    chunks <- 500000
    
    repeat{
      dataChunk <- read.table(con, nrows=chunks, header=FALSE, fill = TRUE,                 
                              sep=";", col.names=db_colnames)
    
      # do processing on dataChunk (i.e adding header, converting data type) 
    
      # Create a new file name and write to it. You can have your own logic for file names 
      write.table(dataChunk, file = paste0("file",index))
    
      #check if file end has been reached and break from repeat
      if(nrow(dataChunk) < chunks){
        break
      }
    
      #increment the index to read next chunk
      index = index+1
    
    }
    

    Edited: Modified to add another option by reading file using data.table::fread as requested by OP.

    library(data.table)
    
    index <- 0
    counter <- 0
    total <- 0
    chunks <- 1000000
    fileName <- "myfile"
    
    repeat{
      # With fread file is opened in each iteration
      dataChunk <- fread(input = fileName, nrows=chunks, header=FALSE, fill = TRUE,                 
                              skip = chunks*index, sep=";", col.names=db_colnames)
    
      # do processing on dataChunk (i.e adding header, converting data type) 
    
      # Create a new file name and write to it. You can have your own logic for file names
      write.table(dataChunk, file = paste0("file",index))
    
      #check if file end has been reached and break from repeat
      if(nrow(dataChunk) < chunks){
        break
      }
    
      #increment the index to read next chunk
      index = index+1
    
    }
    

    Note: The above code is just pseudo code of partial snippet to help OP. It will not run and produce result on its own.