python data-structures jupyter-notebook data-extraction data-ingestion

Extracting data from Multiple Excel files with multiple tabs and multiple columns using Python

I'm trying to create a data ingestion routine to load data from multiple excel files with multiple tabs and columns in a data structure using python. The structuring of the tabs in each of the excel files is the same. Can someone please help me with my code? Please let me know what can be changed here.

folder = r'specified_path'

#Changing the directory to the database directory
os.chdir(folder) 

#Getting the list of files from the assigned path
files = os.listdir(folder) 

#Joining the list of files to the assigned path
for archivedlist in files:
    local_path = os.path.join(folder, archivedlist)
    print("Joined Path: ", local_path)

#Reading the data from the files in the dictionary data structure
main_dict = {}
def readdataframe(files):
    df_dict = {}
    for element in files:
        df_dict[element] = pd.read_excel(element, sheet_name = None)
    print(df_dict[element].keys)
return df_dict

print(readdataframe(files))

Solution

Let's assume you have two files in a directory called excel_test:

1.xlsx 
Sheet1      Sheet2
col1 col2   col1 col2
1     2     3     4

2.xlsx 
Sheet1      Sheet2
col1 col2   col1 col2
5     6     7     8

You can store your extracted data in multiple ways, let's see some methods:

1) A single dictionary

A dictionary where all the keys are strings composed by the "name of the file, underscore, sheet name" and the values are pd.DataFrames

import pandas as pd
import os

files_dir = "excel_test/"
files = os.listdir(files_dir) 

# here will be stored all the content from your files and sheets
sheets_content = {}

# reading all files inside the folder
for file in files:
  
  # reading the content of a xlsx file
  data = pd.ExcelFile(files_dir+file)
  
  # iterating through all sheets
  for sheet in data.sheet_names:
    # saving the content of the sheet for that file (-5 deletes the .xlsx part from the name of the file and makes everything more readable)
    sheets_content[file[:-5]+"_"+sheet] = data.parse(sheet)

print(sheets_content)

Output:

{'1_Sheet1':    
        col1  col2
    0     1     2, 
'1_Sheet2':    
        col1  col2
    0     3     4, 
'2_Sheet1':    
        col1  col2
    0     5     6, 
'2_Sheet2':   
        col1  col2
    0     7     8
}

2) A dictionary of dictionaries

Store all xlsx files in a dictionary with as keys the file names and as value another dictionary. The inner dictionary has keys as sheets names and values are pd.DataFrames:

import pandas as pd
import os

files_dir = "excel_test/"
files = os.listdir(files_dir) 

sheets_content = {}

for file in files:

  data = pd.ExcelFile(files_dir+file)
  
  file_data = {}
  for sheet in data.sheet_names:
    file_data[sheet] = data.parse(sheet)

  sheets_content[file[:-5]] = file_data

Output:

{'1': 
    {'Sheet1':    
           col1  col2
       0     1     2, 
     'Sheet2':    
           col1  col2
       0     3     4}, 
 '2': 
    {'Sheet1':    
           col1  col2
        0     5     6, 
     'Sheet2':    
           col1  col2
        0     7     8}
}

3) A list of dictionaries

Store all xlsx files in an array where each element is a dictionary with as keys the sheets names and values pd.DataFrames:

import pandas as pd
import os

files_dir = "excel_test/"
files = os.listdir(files_dir) 

sheets_content = []

for file in files:

  data = pd.ExcelFile(files_dir+file)
  
  file_data = {}
  for sheet in data.sheet_names:
    file_data[sheet] = data.parse(sheet)
  sheets_content.append(file_data)

Output:

[
 {'Sheet1':    
      col1  col2
   0     1     2, 
  'Sheet2':    
      col1  col2
   0     3     4}, 
 {'Sheet1':    
      col1  col2
   0     5     6, 
  'Sheet2':    
      col1  col2
   0     7     8}
]