Search code examples
pythondata-structuresjupyter-notebookdata-extractiondata-ingestion

Extracting data from Multiple Excel files with multiple tabs and multiple columns using Python


I'm trying to create a data ingestion routine to load data from multiple excel files with multiple tabs and columns in a data structure using python. The structuring of the tabs in each of the excel files is the same. Can someone please help me with my code? Please let me know what can be changed here.

folder = r'specified_path'

#Changing the directory to the database directory
os.chdir(folder) 

#Getting the list of files from the assigned path
files = os.listdir(folder) 

#Joining the list of files to the assigned path
for archivedlist in files:
    local_path = os.path.join(folder, archivedlist)
    print("Joined Path: ", local_path)

#Reading the data from the files in the dictionary data structure
main_dict = {}
def readdataframe(files):
    df_dict = {}
    for element in files:
        df_dict[element] = pd.read_excel(element, sheet_name = None)
    print(df_dict[element].keys)
return df_dict

print(readdataframe(files))

Solution

  • Let's assume you have two files in a directory called excel_test:

    1.xlsx 
    Sheet1      Sheet2
    col1 col2   col1 col2
    1     2     3     4
    
    2.xlsx 
    Sheet1      Sheet2
    col1 col2   col1 col2
    5     6     7     8
    

    You can store your extracted data in multiple ways, let's see some methods:


    1) A single dictionary

    A dictionary where all the keys are strings composed by the "name of the file, underscore, sheet name" and the values are pd.DataFrames

    import pandas as pd
    import os
    
    files_dir = "excel_test/"
    files = os.listdir(files_dir) 
    
    # here will be stored all the content from your files and sheets
    sheets_content = {}
    
    # reading all files inside the folder
    for file in files:
      
      # reading the content of a xlsx file
      data = pd.ExcelFile(files_dir+file)
      
      # iterating through all sheets
      for sheet in data.sheet_names:
        # saving the content of the sheet for that file (-5 deletes the .xlsx part from the name of the file and makes everything more readable)
        sheets_content[file[:-5]+"_"+sheet] = data.parse(sheet)
    
    print(sheets_content)
    

    Output:

    {'1_Sheet1':    
            col1  col2
        0     1     2, 
    '1_Sheet2':    
            col1  col2
        0     3     4, 
    '2_Sheet1':    
            col1  col2
        0     5     6, 
    '2_Sheet2':   
            col1  col2
        0     7     8
    }
    

    2) A dictionary of dictionaries

    Store all xlsx files in a dictionary with as keys the file names and as value another dictionary. The inner dictionary has keys as sheets names and values are pd.DataFrames:

    import pandas as pd
    import os
    
    files_dir = "excel_test/"
    files = os.listdir(files_dir) 
    
    sheets_content = {}
    
    for file in files:
    
      data = pd.ExcelFile(files_dir+file)
      
      file_data = {}
      for sheet in data.sheet_names:
        file_data[sheet] = data.parse(sheet)
    
      sheets_content[file[:-5]] = file_data
    

    Output:

    {'1': 
        {'Sheet1':    
               col1  col2
           0     1     2, 
         'Sheet2':    
               col1  col2
           0     3     4}, 
     '2': 
        {'Sheet1':    
               col1  col2
            0     5     6, 
         'Sheet2':    
               col1  col2
            0     7     8}
    }
    

    3) A list of dictionaries

    Store all xlsx files in an array where each element is a dictionary with as keys the sheets names and values pd.DataFrames:

    import pandas as pd
    import os
    
    files_dir = "excel_test/"
    files = os.listdir(files_dir) 
    
    sheets_content = []
    
    for file in files:
    
      data = pd.ExcelFile(files_dir+file)
      
      file_data = {}
      for sheet in data.sheet_names:
        file_data[sheet] = data.parse(sheet)
      sheets_content.append(file_data)
    

    Output:

    [
     {'Sheet1':    
          col1  col2
       0     1     2, 
      'Sheet2':    
          col1  col2
       0     3     4}, 
     {'Sheet1':    
          col1  col2
       0     5     6, 
      'Sheet2':    
          col1  col2
       0     7     8}
    ]