Search code examples
pythongoogle-colaboratoryedit

file.write() sometimes (but not always) writing text to file


I was using file.write() to add numerical data to a text file. However, after 516159 characters, something interesting happens: about half of the time I run my code, it drops the last 7k characters. The other half, it works fine. Here is some code:

#Create or open file (it strangely couldn't create the file without using mode='x')
try:
  corpus_txt = open("corpus.txt", mode = "x")
except:
  corpus_txt = open("corpus.txt", mode = "w")

corpus_txt.truncate(0)#delete contents

content_length = 0

#X_train is a 2D array of integers
for sentence in X_train:
  for word in sentence:

    corpus_txt.write(str(word)+" ")
    content_length += len(str(word)+" ")

  corpus_txt.write("\n")
  content_length += 1

corpus_txt = open("corpus.txt")
content = corpus_txt.read()
corpus_txt.close()

print("FILE LENGTH (chars):", len(content))
print("TOTAL LENGTH OF TEXT ADDED TO FILE:", content_length)

When I run this repeatedly with my data:

  • "content_length" always equals 523379
  • len("content") alternates between the values 516247 and 523379

Some other information:

  • The missing text occurs at the end of the data (the last 7k characters)
  • It's not the increment of content_length at the newline
  • My data is not altered during this code process
  • I am using Google Colab
  • I get 516k slightly more often than 523k
  • There's no particular pattern for the switches
  • It shouldn't be something about the formatting of the read() method because, once again, it's only the last 7k characters that are missing

I would greatly appreciate any help/explanation here. Thanks!


Solution

  • You need to close() the file after you're finished writing to it; otherwise it's not guaranteed to be flushed to disk, and a subsequent open() won't "see" the writes you did. Using the context manager syntax (with open(...) as ...:) is considered best practice precisely because it makes it almost impossible to make this kind of mistake.

    This ought to work:

    with open("corpus.txt", mode="w") as corpus_txt:
    
        # opening with "w" automatically overwrites previous contents
        content_length = 0
    
        #X_train is a 2D array of integers
        for sentence in X_train:
            for word in sentence:
                corpus_txt.write(str(word)+" ")
                content_length += len(str(word)+" ")
            corpus_txt.write("\n")
            content_length += 1
    
    with open("corpus.txt") as corpus_txt:
        content = corpus_txt.read()
    
    print("FILE LENGTH (chars):", len(content))
    print("TOTAL LENGTH OF TEXT ADDED TO FILE:", content_length)
    

    Unrelated to the file writing issue: I might suggest simplifying it down to just generate content as a string up front (since it's clearly small enough to fit in memory) so you don't need to have all the extra bookkeeping to figure out how long it is:

    with open("corpus.txt", mode="w") as corpus_txt:
        content = "\n".join(
            " ".join(str(word) for word in sentence)
            for sentence in X_train
        ) + "\n"
        corpus_txt.write(content)
    print(f"File length as written: {len(content)}")
    
    with open("corpus.txt") as corpus_txt:
        content = corpus_txt.read()
    print(f"File length as read: {len(content)}")