I am parsing very large text files (30GB+) in Python 2.7.6. To speed the process up a bit, I am splitting the files into chunks and farming them out to subprocesses using the multiprocessing library. To do this, I am iterating over the file in my main process, recording byte positions where I want to split the input file and passing those byte positions to the subprocesses, which then open the input file and read in their block using file.readlines(chunk_size)
. However, I'm finding that the chunks that are read in seem to be much larger (4x) than the sizehint
argument.
Why isn't the sizehint being heeded?
This following code demonstrates my issue:
import sys
# set test chunk size to 2KB
chunk_size = 1024 * 2
count = 0
chunk_start = 0
chunk_list = []
fi = open('test.txt', 'r')
while True:
# increment chunk counter
count += 1
# calculate new chunk end, advance file pointer
chunk_end = chunk_start + chunk_size
fi.seek(chunk_end)
# advance file pointer to end of current line so chunks don't have broken
# lines
fi.readline()
chunk_end = fi.tell()
# record chunk start and stop positions, chunk number
chunk_list.append((chunk_start, chunk_end, count))
# advance start to current end
chunk_start = chunk_end
# read a line to confirm we're not past the end of the file
line = fi.readline()
if not line:
break
# reset file pointer from last line read
fi.seek(chunk_end, 0)
fi.close()
# This code represents the action taken by subprocesses, but each subprocess
# receives one chunk instead of iterating the list of chunks itself.
with open('test.txt', 'r', 0) as fi:
# iterate over chunks
for chunk in chunk_list:
chunk_start, chunk_end, chunk_num = chunk
# advance file pointer to chunk start
fi.seek(chunk_start, 0)
# print some notes and read in the chunk
sys.stdout.write("Chunk #{0}: Size: {1} Start {2} Real Start: {3} Stop {4} "
.format(chunk_num, chunk_end-chunk_start, chunk_start, fi.tell(), chunk_end))
chunk = fi.readlines(chunk_end - chunk_start)
print("Real Stop: {0}".format(fi.tell()))
# write the chunk out to a file for examination
with open('test_chunk{0}'.format(chunk_num), 'w') as fo:
fo.writelines(chunk)
I ran this code with an input file (test.txt) of about 23.3KB and it produced the following output:
Chunk #1: Size: 2052 Start 0 Real Start: 0 Stop 2052 Real Stop: 8193
Chunk #2: Size: 2051 Start 2052 Real Start: 2052 Stop 4103 Real Stop: 10248
Chunk #3: Size: 2050 Start 4103 Real Start: 4103 Stop 6153 Real Stop: 12298
Chunk #4: Size: 2050 Start 6153 Real Start: 6153 Stop 8203 Real Stop: 14348
Chunk #5: Size: 2050 Start 8203 Real Start: 8203 Stop 10253 Real Stop: 16398
Chunk #6: Size: 2050 Start 10253 Real Start: 10253 Stop 12303 Real Stop: 18448
Chunk #7: Size: 2050 Start 12303 Real Start: 12303 Stop 14353 Real Stop: 20498
Chunk #8: Size: 2050 Start 14353 Real Start: 14353 Stop 16403 Real Stop: 22548
Chunk #9: Size: 2050 Start 16403 Real Start: 16403 Stop 18453 Real Stop: 23893
Chunk #10: Size: 2050 Start 18453 Real Start: 18453 Stop 20503 Real Stop: 23893
Chunk #11: Size: 2050 Start 20503 Real Start: 20503 Stop 22553 Real Stop: 23893
Chunk #12: Size: 2048 Start 22553 Real Start: 22553 Stop 24601 Real Stop: 23893
Each of the chunk sizes reported is ~2KB, all of the start/stop positions line up the way they should, and the real file position reported by fi.tell()
seem to be correct, so I'm fairly certain my chunking algorithm is good. However, the real stop locations show that readlines()
is reading much more than the size hint. Also, output files #1 - #8 are 8.0KB, which is much larger than the size hint.
Even if my attempts to only break the chunks on line ends was wrong, readlines()
still shouldn't have to read more than 2KB + one line. Files #9 - #12 get increasingly smaller, which makes sense since the chunk starting points get closer and closer to the end of the file, and readlines()
won't read past the end of the file.
fi.tell()
to get the file size. See my related question.os.path.getsize()
as a stopping condition on the chunking loop, and readlines seemed to work just fine with that method.The buffer the readlines
documentation mentions isn't related to the buffering that the third argument of the open
call controls. The buffer is this buffer in file_readlines
:
static PyObject *
file_readlines(PyFileObject *f, PyObject *args)
{
long sizehint = 0;
PyObject *list = NULL;
PyObject *line;
char small_buffer[SMALLCHUNK];
where SMALLCHUNK
is defined earlier:
#if BUFSIZ < 8192
#define SMALLCHUNK 8192
#else
#define SMALLCHUNK BUFSIZ
#endif
I don't know where BUFSIZ
comes from, but it looks like you're getting the #define SMALLCHUNK 8192
case. In any case, readlines
will never use a buffer smaller than 8 KiB, so you should probably make your chunks bigger than that.