Search code examples
pythondatabaserecurrent-neural-network

Data generation Python


I'm trying to generate a dataset based on an existing one, I was able to implement a method to randomly change the contents of files, but I can’t write all this to a file. Moreover, I also need to write the number of changed words to the file, since I want to use this dataset to train a neural network, could you help me?

Input: files with 2 lines of text in each.

Output: files with 3(maybe) lines: the first line does not change, the second changes according to the method, the third shows the number of words changed (if for deep learning tasks it is better to do otherwise, I would be glad to advice, since I'm a beginner)

from random import randrange
import os

Path = "D:\corrected data\\"
filelist = os.listdir(Path)

if __name__ == "__main__":
    new_words = ['consultable', 'partie ', 'celle ', 'également ', 'forte ', 'statistiques ', 'langue ', 
'cadeaux', 'publications ', 'notre', 'nous', 'pour', 'suivr', 'les', 'vos', 'visitez ', 'thème ', 'thème  ', 'thème ', 'produits', 'coulisses ', 'un ', 'atelier ', 'concevoir  ', 'personnalisés  ', 'consultable', 'découvrir ', 'fournit ', 'trace ', 'dire ', 'tableau', 'décrire', 'grande ', 'feuille ', 'noter ', 'correspondant', 'propre',]
    nb_words_to_replace = randrange(10)

    #with open("1.txt") as file:
    for i in filelist:
       # if i.endswith(".txt"):  
            with open(Path + i,"r",encoding="utf-8") as file:
               # for line in file:
                    data = file.readlines()
                    first_line = data[0]
                    second_line = data[1]
                    print(f"Original: {second_line}")
                   # print(f"FIle: {file}")
                    second_line_array = second_line.split(" ")
                    for j in range(nb_words_to_replace):
                        replacement_position = randrange(len(second_line_array))

                        old_word = second_line_array[replacement_position]
                        new_word = new_words[randrange(len(new_words))]
                        print(f"Position {replacement_position} : {old_word} -> {new_word}")

                        second_line_array[replacement_position] = new_word

                    res = " ".join(second_line_array)
                    print(f"Result: {res}")
            with open(Path + i,"w") as f:
                       for line in file:
                          if line == second_line:
                                f.write(res)

Solution

  • In short, you have two questions:

    • How to properly replace line number 2 (and 3) of the file.
    • How to keep track of number of words changed.

    How to properly replace line number 2 (and 3) of the file.

    Your code:

    with open(Path + i,"w") as f:
       for line in file:
          if line == second_line:
          f.write(res)
    

    Reading is not enabled. for line in file will not work. fis defined, but file is used instead. To fix this, do the following instead:

    with open(Path + i,"r+") as file:
       lines = file.read().splitlines()    # splitlines() removes the \n characters
       lines[1] = second_line
       file.writelines(lines)
    

    However, you want to add more lines to it. I suggest you structure the logic differently.


    How to keep track of number of words changed.

    Add varaible changed_words_count and increment it when old_word != new_word


    Resulting code:

    for i in filelist:
        filepath = Path + i
    
        # The lines that will be replacing the file
        new_lines = [""] * 3
        
        with open(filepath, "r", encoding="utf-8") as file:
            data = file.readlines()
            first_line = data[0]
            second_line = data[1]
            
            second_line_array = second_line.split(" ")
    
            changed_words_count = 0
            for j in range(nb_words_to_replace):
                replacement_position = randrange(len(second_line_array))
    
                old_word = second_line_array[replacement_position]
                new_word = new_words[randrange(len(new_words))]
    
                # A word replaced does not mean the word has changed.
                # It could be replacing itself.
                # Check if the replacing word is different
                if old_word != new_word:
                    changed_words_count += 1
                
                second_line_array[replacement_position] = new_word
            
            # Add the lines to the new file lines
            new_lines[0] = first_line
            new_lines[1] = " ".join(second_line_array)
            new_lines[2] = str(changed_words_count)
            
            print(f"Result: {new_lines[1]}")
        
        with open(filepath, "w") as file:
            file.writelines(new_lines)
    

    Note: Code not tested.