Search code examples
for-loopconcatenationnested-loops

Open csv from subdirectories with partially unknown name and save all csv in one big file


I have a bunch of files in different subfolders of the root folder. I want to open all the files with the name 'NBack' AND '.csv' extension but not containing the letter 'X'. Then I want to add two columns in each files and merge/concatenate all concerned files into one big file.

I created so far this code, but for some reason it runs an eternity and seems to process the same files again and again (but not sure on this point). At the end I don't have a concatenated file but only one single file

for root, folders, files in os.walk(path):
    
for f in files:
    filteredResults = [f for f in files if not "X" in f] #exlude files with the letter 'X'
    
    for ff in filteredResults:
        dd = [ff for ff in filteredResults if ff.endswith('.csv')] #among remaining files, keep the .csv files
        
        for g in dd:
            r = [g for g in dd if 'NBack' in g] #among those, keep those containing 'NBack'
            a = pd.DataFrame()                  #empty dataset for the new big dataset
            for i in r:
                o = [i for i in r if not '.pdf' in i] #exclude .pdf's (for some reason including only .csv didn't work well enough).
                appended = [] #necessary to append files before concatenating them????
                for ii in o: #for the final set of files
                    p = os.path.join(root, ii)
                    data = pd.read_csv(p)       #open .csv with specified characteristics in each subdirectory
                    split = ii.split("_")     #split file name to get additional information
                    data['Run']=split[3]    #add this information as a new column
                    data['IDcheck']=split[0]  #add this information as a new column
                    
                    appended.append(data) #necessary to apprend? creates a list of files
                a = pd.concat([data])  #should create one big file but the variable a just contains one file

I would be happy for any comment or suggestion what to try.... where is the error...


Solution

  • This code works for me, sharing it if ever someone has a similar question:

    os.chdir(r'C:\Users\...')
    rootdir = os.getcwd()
    paths = []
    df = pd.DataFrame()
    
    for root, _, files in os.walk(rootdir):
        for f in files:
            path = root + "\\" + f
            if ".csv" and "NBack" in path and not("X" in path):
                splitt = f.split('_')
                r = pd.read_csv(path)
                r['Run'] = splitt[2]
                r['IDcheck'] = splitt[0]
                df = pd.concat([df, r])
           
    

    Thanks Yasir for the help!