Search code examples
binaryfilescrystal-lang

How to continuously read a binary file in Crystal and get Bytes out of it?


Reading binary files in Crystal is supposed to be done with Bytes.new(size) and File#read, but... what if you don't know how many bytes you'll read in advance, and you want to keep reading chunks at a time?

Here's an example, reading 3 chunks from an imaginary file format that specifies the length of data chunks with an initial byte:

file = File.open "something.bin", "rb"

The following doesn't work, since Bytes can't be concatenated (as it's really a Slice(UInt8), and slices can't be concatenated):

data = Bytes.new(0)

3.times do
    bytes_to_read = file.read_byte.not_nil!
    chunk = Bytes.new(bytes_to_read)
    file.read(chunk)
    data += chunk
end

The best thing I've come up with is to use an Array(UInt8) instead of Bytes, and call to_a on all the bytes read:

data = [] of UInt8

3.times do
    bytes_to_read = file.read_byte.not_nil!
    chunk = Bytes.new(bytes_to_read)
    file.read(chunk)
    data += chunk.to_a
end

However, there's then seemingly no way to turn that back into Bytes (Array#to_slice was removed), which is needed for many applications and recommended by the authors to be the type of all binary data.

So... how do I keep reading from a file, concatenating to the end of previous data, and get Bytes out of it?


Solution

  • One solution would be to copy the data to a resized Bytes on every iteration. You could also collect the Bytes instances in a container (e.g. Array) and merge them at the end, but that would all mean additional copy operations.

    The best solution would probably be to use a buffer that is large enough to fit all data that could possibly be read - or at least be very likely to (resize if necessary). If the maximum size is just 3 * 255 bytes this is a no-brainer. You can size down at the end if the buffer is too large.

    data = Bytes.new 3 * UInt8::MAX
    bytes_read = 0
    3.times do
      bytes_to_read = file.read_byte.not_nil!
      file.read_fully(data + bytes_read)
      bytes_read += bytes_to_read
    end
    # resize to actual size at the end:
    data = data[0, bytes_read]
    

    Note: As the data format tells how many bytes to read, you should use read_fully instead of read which would silently ignore if there are actually less bytes to read.


    EDIT: Since the number of chunks and thus the maximum size is not known in advance (per comment), you should use a dynamically resizing buffer. This can be easily implemented using IO::Memory, which will take care of resizing the buffer accordingly if necessary.

    io = IO::Memory.new
    loop do
      bytes_to_read = file.read_byte
      break if bytes_to_read.nil?
      IO.copy(file, io, bytes_to_read)
    end
    data = io.to_slice