Search code examples
python-3.xpandascsvxls

Merge data from different file types in a folder into a pandas DataFrame


I have a folder that has some files as .csv and some as .xls. I want to get all the data from them into one single dataframe. The good part is that all the files have the same data fields.

I am using the following code:

import os
import glob
import pandas as pd
os.chdir("/content/")

extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]

new_extend = 'xls'
all_files = [i for i in glob.glob('*.{}'.format(extension))]
all_filenames.append(all_files)

#combine all files in the list
combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ])
#export to csv
combined_csv.to_csv( "combined_csv.csv", index=False, encoding='utf-8-sig')

But I get the following error:

ValueError                                Traceback (most recent call last)
<ipython-input-7-747e8d68cec8> in <module>()
      1 #combine all files in the list
----> 2 combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ])
      3 #export to csv
      4 combined_csv.to_csv( "combined_csv.csv", index=False, encoding='utf-8-sig')

3 frames
/usr/local/lib/python3.7/dist-packages/pandas/io/common.py in get_filepath_or_buffer(filepath_or_buffer, encoding, compression, mode, storage_options)
    241     if not is_file_like(filepath_or_buffer):
    242         msg = f"Invalid file path or buffer object type: {type(filepath_or_buffer)}"
--> 243         raise ValueError(msg)
    244 
    245     return filepath_or_buffer, None, compression, False

ValueError: Invalid file path or buffer object type: <class 'list'>

Can someone please tell me what I am doing wrong?


Solution

  • I could not test your method because I had no access to your files, but I've tested it with some sample files you can download here in case you need.

    Try to run the code below, changing only "mypath" to your folder where files are located. In the same folder the script will generate the final dataframe ("1.Final_DF.xlsx"):

    mypath = r"F:\yourFolder"
    
    import pandas as pd
    
    import glob, os
    os.chdir(mypath)
    
    first_df = 0
    ext_csv = ".csv"
    ext_xls = ".xls"
    
    target_files = []
    for file in glob.glob("*xls"):
        target_files.append(file)
    for file in glob.glob("*csv"):
        target_files.append(file)
    
    for file in target_files:
        file_path = mypath + "\\" + file
        if ext_xls in file:
            df_raw = pd.read_excel(file_path)
        else:
            df_raw = pd.read_csv(file_path)
        first_df += 1
        if first_df == 1:
            df = df_raw.copy()
        else:
            df = df.append(df_raw, ignore_index = True)
    
    df.to_excel (r'1.Final_DF.xlsx', index = False)
    

    Let me know ;-)