Search code examples
python-3.xiopython-3.7

Python3 open buffering argument looks strange


From the doc

buffering is an optional integer used to set the buffering policy. Pass 0 to switch buffering off (only allowed in binary mode), 1 to select line buffering (only usable in text mode), and an integer > 1 to indicate the size in bytes of a fixed-size chunk buffer. When no buffering argument is given, the default buffering policy works as follows:

Binary files are buffered in fixed-size chunks; the size of the buffer is chosen using a heuristic trying to determine the underlying device’s “block size” and falling back on io.DEFAULT_BUFFER_SIZE. On many systems, the buffer will typically be 4096 or 8192 bytes long. “Interactive” text files (files for which isatty() returns True) use line buffering. Other text files use the policy described above for binary files.

I open a file named test.log with text mode, and set the buffering to 16. So I think the chunk size is 16, and when I write 32 bytes string to the file. It will call write(syscall) twice. But acutally, it only call once.(test in Python 3.7.2 GCC 8.2.1 20181127 on Linux)

import os


try:
    os.unlink('test.log')
except Exception:
    pass


with open('test.log', 'a', buffering=16) as f:
    for _ in range(10):
        f.write('a' * 32)

Using strace -e write python3 test.py to trace syscall, and get following

write(3, "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 320) = 320

What does the buffering means?


Solution

  • This answer is valid for CPython 3.7 other implementations of Python can differ.

    The open() function in text mode returns _io.TextIOWrapper(). The _io.TextIOWrapper() has internal 'buffer' called pending_bytes with size of 8192 bytes (it is hard coded) and it also have handle on _io.BufferedWriter() for text mode w or _io.BufferedRandom() for text mode a. The size of _io.BufferedWriter()/_io.BufferedRandom() is specified by the argument buffering in the open() function.

    When you call into _io.TextIOWrapper().write("some text") it will add the text into internal pending_bytes buffer. After some writes you will fill the pending_bytes buffer and then it will be written into buffer inside _io.BufferedWriter(). When you fill up also the buffer inside _io.BufferedWriter() then it will be written into target file.

    When you open file in binary mode you will get directly the _io.BufferedWriter()/_io.BufferedRandom() object initialized with buffer size from buffering parametr.

    Let's look at some examples. I will start with simpler one using binary mode.

    # Case 1
    with open('test.log', 'wb', buffering=16) as f:
        for _ in range(5):
            f.write(b'a'*15)
    

    strace output:

    write(3, "aaaaaaaaaaaaaaa", 15)         = 15
    write(3, "aaaaaaaaaaaaaaa", 15)         = 15
    write(3, "aaaaaaaaaaaaaaa", 15)         = 15
    write(3, "aaaaaaaaaaaaaaa", 15)         = 15
    write(3, "aaaaaaaaaaaaaaa", 15)         = 15
    

    In the first iteration it fill buffer with 15 bytes. In the second iteration it discovers that adding another 15 bytes would overflow the buffer so it first flush it (calls system write) and then save those new 15 bytes. In next iteration the same happens again. After last iteration in the buffer is 15 B which are written on close of the file (leaving the with context).

    The second case, I will try write into buffer more data than the buffer's size:

    # Case 2
    with open('test.log', 'wb', buffering=16) as f:
        for _ in range(5):
            f.write(b'a'*17) 
    

    strace output:

    write(3, "aaaaaaaaaaaaaaaaa", 17)       = 17
    write(3, "aaaaaaaaaaaaaaaaa", 17)       = 17
    write(3, "aaaaaaaaaaaaaaaaa", 17)       = 17
    write(3, "aaaaaaaaaaaaaaaaa", 17)       = 17
    write(3, "aaaaaaaaaaaaaaaaa", 17)       = 17
    

    What happens here is that in the first iteration it will try write into buffer 17 B but it cannot fit there so it is directly written into the file and buffer stays empty. This applies for every iteration.

    Now let's look at the text mode.

    # Case 3
    with open('test.log', 'w', buffering=16) as f:
        for _ in range(5):
            f.write('a'*8192)
    

    strace output:

    write(3, "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 16384) = 16384
    write(3, "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 16384) = 16384
    write(3, "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 8192) = 8192
    

    First recall that pending_bytes has size 8192 B. In the first iteration it writes 8192 bytes (from code: 'a'*8192) into pending_bytes buffer. In the second iteration it adds to the pending_buffer another 8192 bytes and discovers it is more than 8192 (size of pending_bytes buffer) and writes it into underlying _io.BufferedWriter(). The buffer in _io.BufferedWriter() has size 16 B (buffering parameter) so it will immediately writes into file (same as case 2). Now the pending_buffer is empty and in the third iteration it's again filled with 8192 B. In the fourth iteration it adds another 8192 B pending_bytes buffer overflows and it again written directly into file as in the second iteration. In the last iteration it adds 8192 B into pending_bytes buffer which is flushed when the files is closed.

    Last example contains buffering bigger than 8192 B. Also for better explanation I added 2 more iterations.

    # Case 4
    with open('test.log', 'w', buffering=30000) as f:
        for _ in range(7):
            f.write('a'*8192)
    

    strace output:

    write(3, "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 16384) = 16384
    write(3, "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 16384) = 16384
    write(3, "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 24576) = 24576
    

    Iterations:

    1. Add 8192 B into pending_bytes.
    2. Add 8192 B into pending_bytes but it is more than maximal size so it is written into underlying _io.BufferedWritter() and it stays there (pending_bytes is empty now).
    3. Add 8192 B into pending_bytes.
    4. Add 8192 B into pending_bytes but it is more than maximal size so it tries to write into into underlying _io.BufferedWritter(). But it would exceed maximal capacity of the underlying buffer cause 16384 + 16384 > 30000 (first 16384 B are still there from iteration 2) so it first writes the old 16384 B into file and then puts those new 16384 B (from pending_bytes) into buffer. (Now again the pending_bytes buffer is empty)
    5. Same as 3
    6. Same as 4
    7. Currently pending_buffer is empty and _io.BufferedWritter() contains 16384 B. In this iteration it fills pending_buffer with 8192 B. And that's it.

    When the program leave with section it close the file. The process of closing follows:

    1. Writes 8192 B from pending_buffer into _io.BufferedWriter() (it is possible cause 8192 + 16384 < 30000)
    2. Writes (8192 + 16384=) 24576 B into file.
    3. Close the file descriptor.

    Btw currently I have no idea why is there that pending_buffer when it can use for buffering the underlying buffer from _io.BufferedWritter(). My best guess is it's there because it improve performance with files working in text mode.