Search code examples
pythonparsingreadlines

Why is readlines() reading much more than the sizehint?


Background

I am parsing very large text files (30GB+) in Python 2.7.6. To speed the process up a bit, I am splitting the files into chunks and farming them out to subprocesses using the multiprocessing library. To do this, I am iterating over the file in my main process, recording byte positions where I want to split the input file and passing those byte positions to the subprocesses, which then open the input file and read in their block using file.readlines(chunk_size). However, I'm finding that the chunks that are read in seem to be much larger (4x) than the sizehint argument.

The Question

Why isn't the sizehint being heeded?

Example Code

This following code demonstrates my issue:

import sys

# set test chunk size to 2KB
chunk_size = 1024 * 2

count = 0
chunk_start = 0
chunk_list = []

fi = open('test.txt', 'r')
while True:
    # increment chunk counter
    count += 1

    # calculate new chunk end, advance file pointer
    chunk_end = chunk_start + chunk_size
    fi.seek(chunk_end)

    # advance file pointer to end of current line so chunks don't have broken 
    # lines
    fi.readline() 
    chunk_end = fi.tell()

    # record chunk start and stop positions, chunk number
    chunk_list.append((chunk_start, chunk_end, count))

    # advance start to current end
    chunk_start = chunk_end

    # read a line to confirm we're not past the end of the file
    line = fi.readline()
    if not line:
        break

    # reset file pointer from last line read
    fi.seek(chunk_end, 0)

fi.close()

# This code represents the action taken by subprocesses, but each subprocess
# receives one chunk instead of iterating the list of chunks itself.
with open('test.txt', 'r', 0) as fi:
    # iterate over chunks
    for chunk in chunk_list:
        chunk_start, chunk_end, chunk_num = chunk

        # advance file pointer to chunk start
        fi.seek(chunk_start, 0)

        # print some notes and read in the chunk
        sys.stdout.write("Chunk #{0}: Size: {1} Start {2} Real Start: {3} Stop {4} "
              .format(chunk_num, chunk_end-chunk_start, chunk_start, fi.tell(), chunk_end))
        chunk = fi.readlines(chunk_end - chunk_start)
        print("Real Stop: {0}".format(fi.tell()))

        # write the chunk out to a file for examination
        with open('test_chunk{0}'.format(chunk_num), 'w') as fo:
            fo.writelines(chunk)

Results

I ran this code with an input file (test.txt) of about 23.3KB and it produced the following output:

Chunk #1: Size: 2052 Start 0 Real Start: 0 Stop 2052 Real Stop: 8193
Chunk #2: Size: 2051 Start 2052 Real Start: 2052 Stop 4103 Real Stop: 10248
Chunk #3: Size: 2050 Start 4103 Real Start: 4103 Stop 6153 Real Stop: 12298
Chunk #4: Size: 2050 Start 6153 Real Start: 6153 Stop 8203 Real Stop: 14348
Chunk #5: Size: 2050 Start 8203 Real Start: 8203 Stop 10253 Real Stop: 16398
Chunk #6: Size: 2050 Start 10253 Real Start: 10253 Stop 12303 Real Stop: 18448
Chunk #7: Size: 2050 Start 12303 Real Start: 12303 Stop 14353 Real Stop: 20498
Chunk #8: Size: 2050 Start 14353 Real Start: 14353 Stop 16403 Real Stop: 22548
Chunk #9: Size: 2050 Start 16403 Real Start: 16403 Stop 18453 Real Stop: 23893
Chunk #10: Size: 2050 Start 18453 Real Start: 18453 Stop 20503 Real Stop: 23893
Chunk #11: Size: 2050 Start 20503 Real Start: 20503 Stop 22553 Real Stop: 23893
Chunk #12: Size: 2048 Start 22553 Real Start: 22553 Stop 24601 Real Stop: 23893

Each of the chunk sizes reported is ~2KB, all of the start/stop positions line up the way they should, and the real file position reported by fi.tell() seem to be correct, so I'm fairly certain my chunking algorithm is good. However, the real stop locations show that readlines() is reading much more than the size hint. Also, output files #1 - #8 are 8.0KB, which is much larger than the size hint.

Even if my attempts to only break the chunks on line ends was wrong, readlines() still shouldn't have to read more than 2KB + one line. Files #9 - #12 get increasingly smaller, which makes sense since the chunk starting points get closer and closer to the end of the file, and readlines() won't read past the end of the file.

Notes

  1. My test input file simply has "< line number >\n" printed on each line, 1-5000.
  2. I tried again with different chunk and input file sizes with similar results.
  3. The readlines documentation says that read sizes may be rounded up to the size of an internal buffer, so I've tried opening the files without buffering (as shown) and it made no difference.
  4. I am using this algorithm to split the file because I need to be able to support *.bz2 and *.gz compressed files, and *.gz files have no way for me to identify the uncompressed file size without decompressing the file. *.bz2 files don't either, but I could seek 0 bytes from the end of those and use fi.tell() to get the file size. See my related question.
  5. Before the requirement to support compressed files was added, the previous version of the script used os.path.getsize() as a stopping condition on the chunking loop, and readlines seemed to work just fine with that method.

Solution

  • The buffer the readlines documentation mentions isn't related to the buffering that the third argument of the open call controls. The buffer is this buffer in file_readlines:

    static PyObject *
    file_readlines(PyFileObject *f, PyObject *args)
    {
        long sizehint = 0;
        PyObject *list = NULL;
        PyObject *line;
        char small_buffer[SMALLCHUNK];
    

    where SMALLCHUNK is defined earlier:

    #if BUFSIZ < 8192
    #define SMALLCHUNK 8192
    #else
    #define SMALLCHUNK BUFSIZ
    #endif
    

    I don't know where BUFSIZ comes from, but it looks like you're getting the #define SMALLCHUNK 8192 case. In any case, readlines will never use a buffer smaller than 8 KiB, so you should probably make your chunks bigger than that.