Search code examples
pythonpython-3.xio

Appending a stream of repeated bytes to IO in Python


I am writing a program that requires writing a large amount of repeated bytes to a file in Python.

In order to simplify the question, I will be using representations of the data. buffer is always the returned result of open(file_name, "a+") and b"0" will an arbitrary representation of the "repeated data"

Suppose I want to write the repeated data 50,000 times, there are 3 ways of doing this I've thought of thus far.

The first way is to generate the data in memory first, then write it (which uses a lot of memory)

data = b"0" * 50_000
buffer.write(data)

The second way is to write each iteration in a loop (which is extremely slow)

for _ in range(50_000):
    buffer.write(b"0")

Lastly the third way is to combine the two, and write bigger segments (which is a lot faster than the second option, but slower than the first, and on top of that, it uses memory even when it's in use, plus overall this design is extremely ugly)

data_x1 = b"0"
data_x10 = b"0" * 10
data_x100 = b"0" * 100
data_x1000 = b"0" * 1000

# writing in a loop using bigger to smaller segments
# until there is no more to write
num_left = 50_000
while num_left > 0:
    if num_left >= 1000:
        buffer.write(data_x1000)
        num_left -= 1000
    elif num_left >= 100:
        buffer.write(data_x100)
        num_left -= 100
    elif num_left >= 10:
        buffer.write(data_x10)
        num_left -= 10
    else:
        buffer.write(data_x1)
        num_left -= 1

TLDR: the goal is to repeatedly write a specified set of bytes to file without using a Python loop, nor generating the entire sequence in memory first.

I have been looking at BufferedWriter's write method and noticed that it takes a bytes-like object. If it is possible, the optimal method would be being able to create a "bytes-like" object that simulates a stream of repeated data, that can be written x amount of times by the buffer without using a Python loop.


Solution

  • Just make the largest block you reasonably can and slice the last write smaller

    # largest block you can put into memory
    # replicate block until some limit
    # must wrap at end if data isn't all the same
    data = b"0" * 50_000
    
    added_length = 120_000
    
    with open(file_name, "ab+") as fh:  # bytes mode makes math work
        while fh.tell() < added_length:
            fh.write(data[:added_length - fh.tell()])  # don't write too much
    

    Note when opening "a", .tell() will start at 0 and is an offset from the initial file ending .. however, you may find opening "r+" and .seek()ing to the end is nicer if you want to read other content in the same go or get the total file size