Search code examples
pythonpandascsvdateglob

Add a month column to each excel file and then merge all files into a .csv


im new using python and for a work purpose im here to ask your help.

I have on a same folder 12 excel files month by month that contains columns like: Product_Name, Quantity and Total_Value

So, what i would like to do but I have no idea how to do are:

  1. Add a month column on each of those files that contain the same date from the file name
  2. Merge those excel files into a unique file

For example:

January-21.xls:

Product_Name (type:string) Quantity (type:float) Total_Value (type:float) Month (type:Date)
Product A 10 250 "File Name" (January-21)
Product B 20 500 "File Name" (January-21)
Product C 15 400 "File Name" (January-21)

February-21.xls:

Product_Name (type:string) Quantity (type:float) Total_Value (type:float) Month (type:Date)
Product A 40 800 "File Name" (February-21)
Product B 25 700 "File Name" (February-21)
Product C 30 500 "File Name" (February-21)

After merge:

Product_Name (type:string) Quantity (type:float) Total_Value (type:float) Month (type:Date)
Product A 10 250 "File Name" (January-21)
Product B 20 500 "File Name" (January-21)
Product C 15 400 "File Name" (January-21)
Product A 40 800 "File Name" (February-21)
Product B 25 700 "File Name" (February-21)
Product C 30 500 "File Name" (February-21)

Is it possible? Sorry for my bad English, I'm not a native speaker.

I really appreciate your help!


Edit.1

This is how i merge, create a csv file and convert to a dataframe using pandas:


import pandas as pd
import os

path = "/content/drive/MyDrive/Colab_Notebooks/sq_datas"
files = [file for file in os.listdir(path) if not file.startswith('.')] # Ignore hidden files

all_months_data = pd.DataFrame()

for file in files:
    current_data = pd.read_excel(path+"/"+file)
    all_months_data = pd.concat([all_months_data, current_data])
    
all_months_data.to_csv("/content/drive/MyDrive/Colab_Notebooks/sq_datas/all_months.csv", index=False)

So, the main problem to me is create a loop for to add the month column before merge all those files in one.


Solution

  • At a basic level, you first need to read your Excel files, such as with pandas.read_excel:

    import pandas as pd
    
    jan21_df = pd.read_excel('January-21.xls')
    feb21_df = pd.read_excel('February-21.xls')
    

    You wrote type:Date for the Month column. To add a date column to each dataframe:

    jan21_df['Month'] = pd.to_datetime('2021-01-01')
    feb21_df['Month'] = pd.to_datetime('2021-02-01')
    

    But if you wanted the file name as string:

    jan21_df['Month'] = "File Name (January-21)"
    feb21_df['Month'] = "File Name (February-21)"
    

    Then to combine the two dataframes:

    combined = pd.concat([jan21_df, feb21_df])
    

    This is a proof of concept. There are ways to automate this further based on the requirements.

    EDIT: based on the edit in the OP, minor addition to the loop:

    for file in files:
        current_data = pd.read_excel(path+"/"+file)
        current_data['Month'] = file
        all_months_data = pd.concat([all_months_data, current_data])