Search code examples
c#iobinaryfiles

How to read/write random chunks from large binary files properly?


I am writing a lib for working with binary files. Specifically "The Log Information Standard (LIS) 79 Subset", which has various types of records with entries of various datatypes. Each entry may be a single value, array or have more complex structure. File size may vary from 3-5 KBs to several GBs.

Goal: read and modify any part of the file of any size.

What has been tried:

  • first implementation simply read the whole file, then wrote every record and datatype to appropriate class instance along with its buffer. It was perfectly fine for small files, but extremely slow and RAM hungry when file size is bigger than 1GB.
  • then I tried to stop using buffers completely. Read the file, stored offset, size and type for each component. This approach worked well for reading, but as I understood after some research there are no way to insert data in the random position of the file.

So the question is how to properly handle such data without overusing RAM?


Solution

  • I have no knowledge about LIS files, so this will will be about binary files in general.

    Many binary file formats will have some type of index, as well as the actual data entries themselves. So reading the file would consist of scanning thru the index until you find what you are looking for and then jump to the offset specified in the index. The index might be defined in a chunk at the beginning of the file, or as a linked list, spread throughout the file. The actual format will likely be much more complicated, but it might be useful as a simplified mental model.

    If you know the format you could simply use a BinaryReader to read values, and jump around in the file accordingly. Probably using some kind of state machine to keep track of what it is you are reading.

    insert data in the random position of the file

    This is really difficult to do well. You will have to chose between wasting space, moving data around, and fragmentation. Databases spend a lot of effort at trying to find a happy medium between each extreme.

    But if you are working with an existing format you will have the choice made for you. If the format is not designed for cheap insertions you will likely need to move the vast majority of data in the file, essentially require you to rewrite the entire thing. If you are lucky the format might allow appending of data cheaply.

    If the format is not designed for cheap modification you should most likely need to convert it to some format that is cheap to modify. If you can keep it all in memory that will likely simplify things.

    You could also just parse the index into an in memory structure, and keep any updates in memory until it is time to write data back to disk. So a imaginary format could look something like this. The key here is that you only read the least amount of data needed from disk, and that additions or modifications are done in memory. Note that this is only for illustrative purposes only.

    public class Index
    {
        private readonly Dictionary<string, IEntry> entries = new();
    
        public IEnumerable<string> List => entries.Keys;
        public byte[] Read(string key) => entries[key].Read();
        public void UpdateOrAdd(string key, byte[] data) => entries[key] = new MemoryEntry(data);
    
        public static Index Load(Stream source)
        {
            var br = new BinaryReader(source);
            var numEntries = br.ReadInt32();
            var result = new Index();
            
            for (int i = 0; i < numEntries; i++)
            {
                var key = br.ReadString();
                var length = br.ReadInt32();
    
                // Note. Mixing index information and data like this will make it 
                // easy to read/append, but slower to load. 
                var offset = (int)br.BaseStream.Position;
                result.entries[key] = new FileEntry(source, offset, length);
                br.BaseStream.Position += length;
            }
            return result;
        }
    
        public void Save(Stream destination)
        {
            var bw = new BinaryWriter(destination);
            bw.Write(entries.Count);
            var list = entries.ToList();
            foreach (var (key, value) in list)
            {
                bw.Write(key);
                bw.Write(value.Length);
                value.CopyTo(bw.BaseStream);
            }
        }
    }
    
    public interface IEntry
    {
        public void CopyTo(Stream destination);
        public byte[] Read();
        public int Length { get; }
    }
    
    public class MemoryEntry : IEntry
    {
        private readonly byte[] data;
        public MemoryEntry(byte[] data) => this.data = data;
        public void CopyTo(Stream destination) => destination.Write(data, 0, data.Length);
        public byte[] Read() => data;
        public int Length => data.Length;
    }
    
    public class FileEntry : IEntry
    {
        private readonly Stream fileStream;
        private readonly int offset;
        private readonly int length;
    
        public FileEntry(Stream fileStream, int offset, int length)
        {
            this.fileStream = fileStream;
            this.offset = offset;
            this.length = length;
        }
    
        public void CopyTo(Stream destination)
        {
            fileStream.Position = offset;
            fileStream.CopyTo(destination, length);
        }
    
        public byte[] Read()
        {
            fileStream.Position = offset;
            var result = new byte[length];
            fileStream.Position = offset;
            var bytesRead = fileStream.Read(result, 0, length);
            if (bytesRead != length) throw new InvalidOperationException("Invalid binary format");
            return result;
        }
    
        public int Length => length;
    }