I was using file.write() to add numerical data to a text file. However, after 516159 characters, something interesting happens: about half of the time I run my code, it drops the last 7k characters. The other half, it works fine. Here is some code:
#Create or open file (it strangely couldn't create the file without using mode='x')
try:
corpus_txt = open("corpus.txt", mode = "x")
except:
corpus_txt = open("corpus.txt", mode = "w")
corpus_txt.truncate(0)#delete contents
content_length = 0
#X_train is a 2D array of integers
for sentence in X_train:
for word in sentence:
corpus_txt.write(str(word)+" ")
content_length += len(str(word)+" ")
corpus_txt.write("\n")
content_length += 1
corpus_txt = open("corpus.txt")
content = corpus_txt.read()
corpus_txt.close()
print("FILE LENGTH (chars):", len(content))
print("TOTAL LENGTH OF TEXT ADDED TO FILE:", content_length)
When I run this repeatedly with my data:
Some other information:
I would greatly appreciate any help/explanation here. Thanks!
You need to close()
the file after you're finished writing to it; otherwise it's not guaranteed to be flushed to disk, and a subsequent open()
won't "see" the writes you did. Using the context manager syntax (with open(...) as ...:
) is considered best practice precisely because it makes it almost impossible to make this kind of mistake.
This ought to work:
with open("corpus.txt", mode="w") as corpus_txt:
# opening with "w" automatically overwrites previous contents
content_length = 0
#X_train is a 2D array of integers
for sentence in X_train:
for word in sentence:
corpus_txt.write(str(word)+" ")
content_length += len(str(word)+" ")
corpus_txt.write("\n")
content_length += 1
with open("corpus.txt") as corpus_txt:
content = corpus_txt.read()
print("FILE LENGTH (chars):", len(content))
print("TOTAL LENGTH OF TEXT ADDED TO FILE:", content_length)
Unrelated to the file writing issue: I might suggest simplifying it down to just generate content
as a string up front (since it's clearly small enough to fit in memory) so you don't need to have all the extra bookkeeping to figure out how long it is:
with open("corpus.txt", mode="w") as corpus_txt:
content = "\n".join(
" ".join(str(word) for word in sentence)
for sentence in X_train
) + "\n"
corpus_txt.write(content)
print(f"File length as written: {len(content)}")
with open("corpus.txt") as corpus_txt:
content = corpus_txt.read()
print(f"File length as read: {len(content)}")