Search code examples
python-3.xtext-filesread-write

python read and write to file adds unexpected chars when script interrupted


Here’s my problem, I have a script that has many steps, basically it opens a file, reads it and after reading it it writes back into the file. All is good when the script completes. Problems occur when there is an exception of sorts or the script is interrupted. I open the file in ‘r+’ mode because if I open it in ‘w’ mode, the file becomes blank right away, and if the script is interrupted, it stays blank, while I want it to keep the previous value. Below is an example, but not the exact script I am running If the script is interrupted (or exception occurs, even if it is handled), the value inside the test.txt will be “myVar=13e” or “myVar=13ne”. Not always, but often. Why does it happen and how to avoid it?

import time
from test import myVar
file_path = "./test.py"
with open(file_path, 'r+', encoding=‘utf-8’) as f:
    # read the file content which is for example “myVar=11”
    # do calculations with myVar
    #str_to_oc = "myVar="+str(row[0]) #row[0] is fetched from database, it’s ID of the record. It’s an integer
    str_to_oc = “myVar=“+str(13) # I hardcoded the 13 value here instead of the database row[0]
    time.sleep(3) #just adding a delay so you can interrupt easily
    # write back a string “myVar=13” which is he new value of 13
    f.write(str_to_oc)

Edited the code sample to make it easy to test

One more point: things like this can happen due to the default encoding of the system on which the script is running. The solution would be to always specify the encoding explicitly on both read and write with something like encoding='utf_8'


Solution

  • You are seeing a buffering effect.

    You can reduce the effect by tacking on a flush call:

        f.write(str_to_oc)
        f.flush()
    

    A CTRL/C arrives asynchronously, so this won't fix it entirely. Also, if you choose to insert / delete, so that individual records and the overall file size change, you will be unhappy with how old + new records are misaligned.

    Behind the scenes, an io.BufferedWriter is occasionally requesting a raw write, which turns into an OS-level syscall. You say that CTRL/C or fatal stack trace causes the program to terminate early. In that case the whole python interpreter process exits, causing implicit close(), which can lead to a combination of old + new bytes being read from your file. Note that a multibyte UTF8 code point can span disk blocks, which may lead to unhappiness.

    Given the observed reliability of your program, it sounds like you would be well served to leave the original untouched until processing successfully completes:

    tmp_path = file_path + '.tmp'
    with open(file_path) as fin:
        with open(tmp_path, 'w') as fout:
            for line in fin:
                # (do stuff, compute output)
                fout.write(out_line + '\n')
    
    os.rename(tmp_path, file_path)  # atomic operation, all-or-nothing