Search code examples
rmemorybinarylarge-files

Raw (binary) data too big to write to disk. How to write chunk-wise to disk (appending)?


I have a large raw vector in R (i.e. array of binary data) that I want to write to disk, but I'm getting an error telling me the vector is too large. Here's a reproducible example and the error I get:

> writeBin(raw(1024 * 1024 * 1024 * 2), "test.bin")

Error in writeBin(raw(1024 * 1024 * 1024 * 2), "test.bin") : 
  long vectors not supported yet: connections.c:4147

I've noticed that this is linked to the 2 GB file limit. If I try to write a single byte less (1024 * 1024 * 1024 * 2 - 1), it works just fine.

I was thinking about doing some kind of workaround, where I write chunks of the large file to disk in batches, only appending the binary data to the disk, like this:

large_file = raw(1024 * 1024 * 1024 * 2) 
chunk_size = 1024*1024*512
n_chunks = ceiling(length(large_file)/chunk_size)

for (i in 1:n_chunks)
{
  start_byte = ((i - 1) * chunk_size) + 1
  end_byte = start_byte + chunk_size - 1
  if (i == n_chunks)
    end_byte = length(large_file)
  this_chunk = large_file[start_byte:end_byte]
  appendBin(this_chunk, "test.bin") # <-- non-existing magical formula!
}

But I can't find any kind of function like the "appendBin" I wrote above or any other documentation in R that tells me how to append data straight to the disk.

So my question boils down to this: does anyone know how to append raw (binary) data to a file already on disk without having to read the full file on disk to memory first?

Extra details: I'm currently using R version 3.4.2 64bit on a Windows 10 PC with 192GB of RAM. I tried on another PC (R version 3.5 64bit, Windows 8 with 8GB of RAM) and had the exact same problem.

Any kind of insight or workaround would be greatly appreciated!!!

Thank you!


Solution

  • Thanks to @MichaelChirico and @user2554330, I was able to figure out a work around. Essentially, I just need to open the file in "a+b" mode as a new connection and feed that file connection into the writeBin function.

    Here's a copy of the working code.

    large_file = raw(1024 * 1024 * 1024 * 3) 
    chunk_size = 1024*1024*512
    n_chunks = ceiling(length(large_file)/chunk_size)
    
    if (file.exists("test.bin"))
      file.remove("test.bin")
    
    for (i in 1:n_chunks)
    {
      start_byte = ((i - 1) * chunk_size) + 1
      end_byte = start_byte + chunk_size - 1
      if (i == n_chunks)
        end_byte = length(large_file)
      this_chunk = large_file[start_byte:end_byte]
      output_file = file(description="test.bin",open="a+b")
      writeBin(this_chunk, output_file)
      close(output_file)
    }
    

    I know it's ugly that I'm opening and closing the file multiple times, but that kept the error from popping up with even bigger files.

    Thanks again for the insights, guys! =)