I am trying to understand file IO in Python, with the different modes for open, and reading and writing to the same file object (just self learning).
I was surprised by the following code (which was just me exploring):
with open('blah.txt', 'w') as f:
# create a file with 13 characters
f.write("hello, world!")
with open('blah.txt', 'r+') as f:
for i in range(5):
# write one character
f.write(str(i))
# then print current position and next 3 characters
print(f"{f.tell()}: ", f"{f.read(3)}")
with open('blah.txt', 'r') as f:
# look at how the file was modified
print(f.read())
Which output:
1: ell
14:
15:
16:
17:
0ello, world!1234
As I expected, the first character was overwritten by 0
, then the next 3 characters read were ell
, but I expected the 1
to be written over the o
in hello
, then the next 3 characters read to be , w
.
I'm reading the docs here, but I don't see where it explains the behavior that I observed.
It appears that the first read, no matter what the size, seeks to the end of the file.
Can anyone provide a link to where it explains this in the docs?
I tried searching for a similar question on this site, but while there were many questions related to read, none that I found mentioned this behavior.
UPDATE
After more exploration, it is not the first read
that seeks to the end of the file, but rather the second write that does. Again, I'm not sure why, which is why I'm hoping to find somewhere in the docs that explains this behavior.
Here's my change to the code above that shows that it's not the first read:
with open('blah.txt', 'w') as f:
# create a file with 13 characters
f.write("hello, world!")
with open('blah.txt', 'r+') as f:
for i in range(3):
# write one character
f.write(str(i))
# then print current position and next 3 characters
print(f"{f.tell()}: ", f"{f.read(3)}")
print(f"{f.tell()}: ", f"{f.read(3)}")
with open('blah.txt', 'r') as f:
# look at how the file was modified
print(f.read())
Which output:
1: ell
4: o,
14:
14:
15:
15:
0ello, world!12```
Consider this example:
with open('test.txt', 'w') as f:
f.write('HelloEmpty')
with open('test.txt', 'r+') as f:
print(f.read(5))
print(f.write('World'))
f.flush()
f.seek(0)
print(f.read(10))
You might expect this to print:
Hello
5
HelloWorld
Instead, it prints:
Hello
5
HelloEmpty
And the file contains 'HelloEmptyWorld'
after execution.
Even though this code:
with open('test.txt', 'w') as f:
f.write('HelloEmpty')
with open('test.txt', 'r+') as f:
print(f.read(5))
print(f.read(5))
Works as expected and prints:
Hello
Empty
So, the .read()
doesn't position the pointer at the end of the file, otherwise the second print statement should have caused an error or come up empty.
However, consider this example:
with open('test.txt', 'w') as f:
for _ in range(10000):
f.write('HelloEmpty')
with open('test.txt', 'r+') as f:
print(f.read(5))
print(f.write('World'))
If you execute this code, and then look at the file, you will find that at position 8193, the word 'World' has been written.
So, it appears that Python reads the text data in 8192 byte or character chunks, and although consecutive calls to .read()
track the position in the read buffer, calls to .write()
will use the actual file pointer, which has been moved 8k ahead (or to the end of the file, whichever comes first).
Whether it's characters or bytes, you can tell from this:
with open('test.txt', 'w', encoding='utf16') as f:
for _ in range(10000):
f.write('HelloEmpty')
with open('test.txt', 'r+', encoding='utf16') as f:
print(f.read(5))
print(f.write('World'))
Now, the character size in the file is 2 bytes, and the word 'world' gets written at position 4097 (in characters, but still after byte 8192), so the buffer size is in bytes.
(note that 8192 and 4096 are powers of two, in case those numbers seem arbitrary; also note that the first large file is exactly 100000 bytes in size, as expected, but the second one is 200002 bytes in size, which causes the world 'World'
to be offset a bit, due to the encoding chosen and the byte order mark)
Edit, a bit more information:
Consider this:
with open('test1.txt', 'w') as f:
f.write('x' * 100000)
with open('test1.txt', 'r+') as f:
s1 = f.read(5) # 5. read first 5 bytes, expected to be 'xxxxx'
f.seek(0) # 6. seek back to beginning of file
f.write('y' * 5) # 7. write 5 bytes of 'yyyyy'
f.read(5) # 8. then read another 5 bytes (discarded, but would be 'xxxxx')
f.flush() # 9. flush the buffer, if any
f.seek(0) # 10. seek the beginning of the file once more
s2 = f.read(5) # 11. read 5 characters, expected to be 'yyyyy', but in fact 'xxxxx' (Win 10, Python 3.11.6)
print(s1, s2)
# without line 8:
with open('test2.txt', 'w') as f:
f.write('x' * 100000)
with open('test2.txt', 'r+') as f:
s1 = f.read(5)
f.seek(0)
f.write('y' * 5)
# f.read(5)
f.flush()
f.seek(0)
s2 = f.read(5)
print(s1, s2) # the expected result
Output:
xxxxx xxxxx
xxxxx yyyyy
This shows that performing a .read()
after a .write()
, before flushing the buffer, can cause very unexpected results. You'll find that test2.txt
will have yyyyy
written at the start, as expected, but test1.txt
will have it written after the first read buffer.
I'm not sure if this shouldn't in fact be considered a bug in Python...