Search code examples
pythonpandasdataframenumpydata-processing

Read and process data files from folder, add column based on filename and merge


I have .csv files in a folder for 10 different people. The files are named "data_m1", "data_m2", etc. I want to create a for loop to read all the files, process them by applying certain functions and creating new columns as features, then merge them to one file efficiently. In the process, I want to read the filename and add the column "name" and label the data according to filenames "m1", "m2", etc.

let's say I want to apply this simple process which creates a new column for each 10 files in the folder

df['new_column1']= df['value'].apply(lambda x:  1 if 0 < x <= 10 else 2 if 10 <x<20 else np.nan)

Then, I want to combine all files into one dataframe at the end but labeling them with name column by the extensions "m1" , "m2" ,etc.


Solution

  • You can try the following

    import glob
    import os
    import pandas as pd
    
    
    # create an empty list to append DataFrames
    dfs = []
    # use glob to get a list of file names and iterate
    for file in glob.glob('/path/to/folder/data_m*.csv'):
        # read the file
        df = pd.read_csv(file)
        # assign the name column based on the file name
        df['name'] = os.path.splitext(file)[0].split('_', 1)[1]
        # do more stuff here
        # append df to the empty list
        dfs.append(df)
    
    # concat all your frames together
    final_df = pd.concat(dfs)