Search code examples
pythonpandasdataframefor-loopexport-to-csv

How do you iterate through files directory and if a keyword is in them write to a different data frame?


I am currently trying to iterate through all of the files in a directory and then write 'Yes' or 'No' to columns in a new data frame if certain strings appear in the files.

This works the way I would expect it to, it prints 'Yes' or 'No' to the terminal based on if any of the words_in_file are present.

import pandas as pd
import numpy as np
from Byron import copy_to_processor_directory
from pip import qualify_file_name, FileCompare, normalize_file_extension
from pep.settings import WORKSPACE_ROOT, ACCOUNT_HOME
import sys
import os


file_results = pd.DataFrame()
file_results['test_case_found'] = ''
words_in_file = ['remote_directory', 'file_path']

def main():
    for subdir, dirs, files in os.walk(ACCOUNT_HOME):
        for file in files:
            directory_files = open(os.path.join(subdir, file), 'r')
            directory_file_code = directory_files.read()
            for key_word in words_in_file:
                if key_word in directory_file_code:
                    print('yes')
                else:
                    print('No')



file_results.to_csv('test.csv', index=False)

if __name__ == '__main__':
    main()

However, I expect the code below to then proceed to write 'Yes' or 'No' to each row of my file_results data frame, but it does not.

import pandas as pd
import numpy as np
from Byron import copy_to_processor_directory
from pip import qualify_file_name, FileCompare, normalize_file_extension
from pep. settings import WORKSPACE_ROOT, ACCOUNT_HOME
import sys
import os


file_results = pd.DataFrame()
file_results['test_case_found'] = ''
words_in_file = ['remote_directory', 'file_path']

def main():
    for subdir, dirs, files in os.walk(ACCOUNT_HOME):
        for file in files:
            directory_files = open(os.path.join(subdir, file), 'r')
            directory_file_code = directory_files.read()
            for key_word in words_in_file:
                if key_word in directory_file_code:
                    print('yes')
                    file_results['test_case_found'] = 'Yes'
                else:
                    print('No')
                    file_results['test_cause_found'] = 'No'



file_results.to_csv('test.csv', index=False)

if __name__ == '__main__':
    main()

I have found lots of examples for if you are writing to the same data frame as you are iterating through, but I am iterating through files that I am reading and trying to write to a new data frame rather than just a file. Please help!


Solution

  • What's wrong with your code : For a given dataframe df with n rows, df['col'] = '' would create the column col if it doesn't exist and set all the n entries of the column to the value '' (same for any other string). As you started from an empty dataframe, file_results['test_case_found'] = '' creates a column test_case_found and sets all its values, which are none, to '', so it basically just creates an empty column in your empty dataframe. Then, everytime you are repeating the same mistake of setting up the 0 values of an empty column to a string, which changes nothing.

    Also, you're saving to csv before the main function is called, so even if your function were correct, you'd still be saving an empty dataframe.

    What you should do : You could create a list that you update throughout your loop, and then you create your column from that list, so your column will have the same length of that list, and it will store the same data :

    file_results = pd.DataFrame()
    words_in_file = ['remote_directory', 'file_path']
    def main():
        results = []
        for subdir, dirs, files in os.walk(ACCOUNT_HOME):
            for file in files:
                directory_files = open(os.path.join(subdir, file), 'r')
                directory_file_code = directory_files.read()
                for key_word in words_in_file:
                    if key_word in directory_file_code:
                        print('yes')
                        results.append('Yes')
                    else:
                        print('No')
                        results.append('No')
    
        file_results['test_case_found'] = results
    

    Alternatively, you could create your dataframe directly from the list, so you don't need the first line, and you replace the last line with : file_results = pd.DataFrame({'test_case_found': results})