I am writing a program that requires writing a large amount of repeated bytes to a file in Python.
In order to simplify the question, I will be using representations of the data.
buffer
is always the returned result of open(file_name, "a+")
and b"0"
will an arbitrary representation of the "repeated data"
Suppose I want to write the repeated data 50,000 times, there are 3 ways of doing this I've thought of thus far.
The first way is to generate the data in memory first, then write it (which uses a lot of memory)
data = b"0" * 50_000
buffer.write(data)
The second way is to write each iteration in a loop (which is extremely slow)
for _ in range(50_000):
buffer.write(b"0")
Lastly the third way is to combine the two, and write bigger segments (which is a lot faster than the second option, but slower than the first, and on top of that, it uses memory even when it's in use, plus overall this design is extremely ugly)
data_x1 = b"0"
data_x10 = b"0" * 10
data_x100 = b"0" * 100
data_x1000 = b"0" * 1000
# writing in a loop using bigger to smaller segments
# until there is no more to write
num_left = 50_000
while num_left > 0:
if num_left >= 1000:
buffer.write(data_x1000)
num_left -= 1000
elif num_left >= 100:
buffer.write(data_x100)
num_left -= 100
elif num_left >= 10:
buffer.write(data_x10)
num_left -= 10
else:
buffer.write(data_x1)
num_left -= 1
TLDR: the goal is to repeatedly write a specified set of bytes to file without using a Python loop, nor generating the entire sequence in memory first.
I have been looking at BufferedWriter
's write method and noticed that it takes a bytes-like object.
If it is possible, the optimal method would be being able to create a "bytes-like" object that simulates a stream of repeated data, that can be written x amount of times by the buffer without using a Python loop.
Just make the largest block you reasonably can and slice the last write smaller
# largest block you can put into memory
# replicate block until some limit
# must wrap at end if data isn't all the same
data = b"0" * 50_000
added_length = 120_000
with open(file_name, "ab+") as fh: # bytes mode makes math work
while fh.tell() < added_length:
fh.write(data[:added_length - fh.tell()]) # don't write too much
Note when opening "a"
, .tell()
will start at 0
and is an offset from the initial file ending .. however, you may find opening "r+"
and .seek()
ing to the end is nicer if you want to read other content in the same go or get the total file size