Search code examples
pythonmemory-mapped-filespyarrowmemory-mappingapache-arrow

In PyArrow, how to append rows of a table to a memory mapped file?


as you can see in the code below, I'm having troubles adding new rows to a Table saved in a memory mapped file. I just want to write the file again with the new rows.

import pyarrow as pa

source = pa.memory_map(path, 'r')
table = pa.ipc.RecordBatchFileReader(source).read_all()
schema = pa.ipc.RecordBatchFileReader(source).schema
new_table = create_arrow_table(schema.names) #new table from pydict with same schema and random new values
updated_table = pa.concat_tables([table, new_table], promote=True)   
source.close()
with pa.MemoryMappedFile(path, 'w') as sink:
   with pa.RecordBatchFileWriter(sink, updated_table.schema) as writer:
      writer.write_table(table)

I get an Exception stating that the memory mapped file is not closed: ValueError: I/O operation on closed file.

Any suggestion?


Solution

  • Your immediate issue is that you are using pa.MemoryMappedFile(path, 'w') instead of pa.memory_map(path, 'w'). The latter is defined as...

    _check_is_file(path)
    cdef MemoryMappedFile mmap = MemoryMappedFile()
    mmap._open(path, mode)
    return mmap
    

    ...so it should be pretty clear why it was closed.

    The next issue you'll run into (assuming it isn't a copy/paste error into SO) is that you are writing table and not updated_table. Easily fixed.

    The third issue is more problematic. Memory mapped files have a fixed size and cannot grow naturally in the same way that normal files do. If you try and write your updated table into the same file you will see...

    OSError: Write out of bounds (offset = ..., size = ...) in file of size ...
    

    This problem is not so easily overcome. You could resize the memory map (sink.resize(...)) to some "big enough" size but then you end up with a file with a bunch of 0's at the end so you'll need to make sure to shrink it back down after you write and I'm not really sure if that's going to give you better performance than writing a regular file.

    You could write to a bytes object and then resize the file and write your bytes to the memory mapped file but that's going to give you some extra bookkeeping and I don't know the performance impact of resizing the file.