Search code examples
pythonpandassas-macro

How to run the same code with input different files and output different files? Same as SAS Macro


Hi I am looking for a step to iterate the following:

  1. Data_Dec2019.xlsx
  2. Data_Mar2020.xlsx
  3. Data_Jun2020.xlsx
  4. ...
  5. ...
  6. Base_Data.xlsx

The code is as below:

#Importing package
import pandas as pd

df_base = pd.read_excel(r'...\Base_Data.xlsx')

df_data = pd.read_excel(r'...\Data_Jun2020.xlsx')

#Data Cleaning for Base
df_base = df_base.sort_values(by='CustomerNumber')

df_base = df_base.drop_duplicates('CustomerNumber')


#Data Cleaning for Data
df_data = df_data.groupby (['CustomerNumber'])['Amount'].sum().reset_index()

#Map both data
df_mapped = pd.merge (df_base, df_data, on = "CustomerNumber", how = "inner")

#Export Data
df_mapped.to_excel(r'...\Merged_June2020.xlsx')

I have more than 10 data with different snapshots, would like to iterate the process. Tried to look through the community post but mostly was a complex issues. Note that the data processing code is longer and more complex, but for illustration purpose, I shared the simplified version just to understand if there is a easy way to do so.

Appreciate the help.


Solution

  • Create a function to process file and iterate over filenames with pathlib. Adapt the code below:

    import pathlib
    
    data_dir = 'file/path/to/data'
    
    # load base data
    df_base = pd.read_excel(r'...\Base_Data.xlsx') \
                .sort_values(by='CustomerNumber') \
                .drop_duplicates('CustomerNumber')
    
    
    # load, process and export a file
    def process_data(file):
        df_data = pd.read_excel(file) \
                    .groupby (['CustomerNumber'])['Amount'] \
                    .sum().reset_index()
    
        df_base.merge(df_data, on="CustomerNumber") \
               .to_excel(file.parent / f'Merged_{file.name}')
    
    
    for file in data_dir.glob('Data_*.xlsx'):
       process_data(file)