file.write() sometimes (but not always) writing text to file

I was using file.write() to add numerical data to a text file. However, after 516159 characters, something interesting happens: about half of the time I run my code, it drops the last 7k characters. The other half, it works fine. Here is some code:

#Create or open file (it strangely couldn't create the file without using mode='x')
try:
  corpus_txt = open("corpus.txt", mode = "x")
except:
  corpus_txt = open("corpus.txt", mode = "w")

corpus_txt.truncate(0)#delete contents

content_length = 0

#X_train is a 2D array of integers
for sentence in X_train:
  for word in sentence:

    corpus_txt.write(str(word)+" ")
    content_length += len(str(word)+" ")

  corpus_txt.write("\n")
  content_length += 1

corpus_txt = open("corpus.txt")
content = corpus_txt.read()
corpus_txt.close()

print("FILE LENGTH (chars):", len(content))
print("TOTAL LENGTH OF TEXT ADDED TO FILE:", content_length)

When I run this repeatedly with my data:

"content_length" always equals 523379
len("content") alternates between the values 516247 and 523379

Some other information:

The missing text occurs at the end of the data (the last 7k characters)
It's not the increment of content_length at the newline
My data is not altered during this code process
I am using Google Colab
I get 516k slightly more often than 523k
There's no particular pattern for the switches
It shouldn't be something about the formatting of the read() method because, once again, it's only the last 7k characters that are missing

I would greatly appreciate any help/explanation here. Thanks!

Solution

You need to close() the file after you're finished writing to it; otherwise it's not guaranteed to be flushed to disk, and a subsequent open() won't "see" the writes you did. Using the context manager syntax (with open(...) as ...:) is considered best practice precisely because it makes it almost impossible to make this kind of mistake.

This ought to work:

with open("corpus.txt", mode="w") as corpus_txt:

    # opening with "w" automatically overwrites previous contents
    content_length = 0

    #X_train is a 2D array of integers
    for sentence in X_train:
        for word in sentence:
            corpus_txt.write(str(word)+" ")
            content_length += len(str(word)+" ")
        corpus_txt.write("\n")
        content_length += 1

with open("corpus.txt") as corpus_txt:
    content = corpus_txt.read()

print("FILE LENGTH (chars):", len(content))
print("TOTAL LENGTH OF TEXT ADDED TO FILE:", content_length)

Unrelated to the file writing issue: I might suggest simplifying it down to just generate content as a string up front (since it's clearly small enough to fit in memory) so you don't need to have all the extra bookkeeping to figure out how long it is:

with open("corpus.txt", mode="w") as corpus_txt:
    content = "\n".join(
        " ".join(str(word) for word in sentence)
        for sentence in X_train
    ) + "\n"
    corpus_txt.write(content)
print(f"File length as written: {len(content)}")

with open("corpus.txt") as corpus_txt:
    content = corpus_txt.read()
print(f"File length as read: {len(content)}")