Search code examples
pythonpandascsvconcatenationglob

Concatenating multiple dataframes. Issue with datapaths


I want to concatenate several csv files which I saved in a directory ./Errormeasure. In order to do so, I used the following answer from another thread https://stackoverflow.com/a/51118604/9109556

filepaths =[f for f in listdir('./Errormeasure')if f.endswith('.csv')]
df=pd.concat(map(pd.read_csv,filepaths))
print(df)

However, this code only works, when I have the csv files I want to concatentate both in the ./Errormeasure directory as well as in the directory below, ./venv. This however is obviously not convenient. When I have the csv files only in the ./Errormeasure, I recieve the following error:

FileNotFoundError: [Errno 2] File b'errormeasure_871687110001543570.csv' does not exist: b'errormeasure_871687110001543570.csv'

Can you give me some tips to tackle this problem? I am using pycharm. Thanks in advance!


Solution

  • Using os.listdir() only retrieves file names and not parent folders which is needed for pandas.read_csv() at relative (where pandas script resides) or absolute levels.

    Instead consider the recursive feature of built-in glob (only available in Python 3.5+) to return full paths of all csv files at top level and subfolders.

    import glob
    
    for f in glob.glob(dirpath + "/**/*.csv", recursive=True):
        print(f)
    

    From there build data frames in list comprehension (bypassing map -see List comprehension vs map) to be concatenated with pd.concat:

    df_files = [pd.read_csv(f) for f in glob.glob(dirpath + "/**/*.csv", recursive=True)]
    df = pd.concat(df_files)
    print(df)
    

    For Python < 3.5, consider os.walk() + os.listdir() to retrieve full paths of csv files:

    import os
    import pandas as pd
    
    # COMBINE CSVs IN CURR FOLDER + SUB FOLDERS
    fpaths = [os.path.join(dirpath, f) 
                for f in os.listdir(dirpath) if f.endswith('.csv')] + \
             [os.path.join(fdir, fld, f) 
                for fdir, flds, ffile in os.walk(dirpath) 
                for fld in flds  
                for f in os.listdir(os.path.join(fdir, fld)) if f.endswith('.csv')]
    
    df = pd.concat([pd.read_csv(f) in for f in fpaths])
    print(df)