I'm trying to generate a dataset based on an existing one, I was able to implement a method to randomly change the contents of files, but I can’t write all this to a file. Moreover, I also need to write the number of changed words to the file, since I want to use this dataset to train a neural network, could you help me?
Input: files with 2 lines of text in each.
Output: files with 3(maybe) lines: the first line does not change, the second changes according to the method, the third shows the number of words changed (if for deep learning tasks it is better to do otherwise, I would be glad to advice, since I'm a beginner)
from random import randrange
import os
Path = "D:\corrected data\\"
filelist = os.listdir(Path)
if __name__ == "__main__":
new_words = ['consultable', 'partie ', 'celle ', 'également ', 'forte ', 'statistiques ', 'langue ',
'cadeaux', 'publications ', 'notre', 'nous', 'pour', 'suivr', 'les', 'vos', 'visitez ', 'thème ', 'thème ', 'thème ', 'produits', 'coulisses ', 'un ', 'atelier ', 'concevoir ', 'personnalisés ', 'consultable', 'découvrir ', 'fournit ', 'trace ', 'dire ', 'tableau', 'décrire', 'grande ', 'feuille ', 'noter ', 'correspondant', 'propre',]
nb_words_to_replace = randrange(10)
#with open("1.txt") as file:
for i in filelist:
# if i.endswith(".txt"):
with open(Path + i,"r",encoding="utf-8") as file:
# for line in file:
data = file.readlines()
first_line = data[0]
second_line = data[1]
print(f"Original: {second_line}")
# print(f"FIle: {file}")
second_line_array = second_line.split(" ")
for j in range(nb_words_to_replace):
replacement_position = randrange(len(second_line_array))
old_word = second_line_array[replacement_position]
new_word = new_words[randrange(len(new_words))]
print(f"Position {replacement_position} : {old_word} -> {new_word}")
second_line_array[replacement_position] = new_word
res = " ".join(second_line_array)
print(f"Result: {res}")
with open(Path + i,"w") as f:
for line in file:
if line == second_line:
f.write(res)
In short, you have two questions:
Your code:
with open(Path + i,"w") as f:
for line in file:
if line == second_line:
f.write(res)
Reading is not enabled. for line in file
will not work. f
is defined, but file
is used instead. To fix this, do the following instead:
with open(Path + i,"r+") as file:
lines = file.read().splitlines() # splitlines() removes the \n characters
lines[1] = second_line
file.writelines(lines)
However, you want to add more lines to it. I suggest you structure the logic differently.
Add varaible changed_words_count
and increment it when old_word != new_word
for i in filelist:
filepath = Path + i
# The lines that will be replacing the file
new_lines = [""] * 3
with open(filepath, "r", encoding="utf-8") as file:
data = file.readlines()
first_line = data[0]
second_line = data[1]
second_line_array = second_line.split(" ")
changed_words_count = 0
for j in range(nb_words_to_replace):
replacement_position = randrange(len(second_line_array))
old_word = second_line_array[replacement_position]
new_word = new_words[randrange(len(new_words))]
# A word replaced does not mean the word has changed.
# It could be replacing itself.
# Check if the replacing word is different
if old_word != new_word:
changed_words_count += 1
second_line_array[replacement_position] = new_word
# Add the lines to the new file lines
new_lines[0] = first_line
new_lines[1] = " ".join(second_line_array)
new_lines[2] = str(changed_words_count)
print(f"Result: {new_lines[1]}")
with open(filepath, "w") as file:
file.writelines(new_lines)
Note: Code not tested.