Search code examples
pythonexcelpandasdataframepathlib

programtically ingesting xl files to pandas data frame by reading filename


I have a folder with 6 files, 4 are excel files that I would like to bring into pandas and 2 are just other files. I want to be able to use pathlib to work with the folder to automatically ingest the excel files I want into individual pandas dataframes. I would also like to be able to name each new dataframe with the name of the excel file (without the file extension)

for example.

import pandas as pd
import pathlib as pl

folder = pl.WindowsPath(r'C:\Users\username\project\output')
files = [e for e in folder.iterdir()]

for i in files:
    print(i)

['C:\Users\username\project\output\john.xlsx',
    'C:\Users\username\project\output\paul.xlsx',
    'C:\Users\username\project\output\random other file not for df.xlsx',
    'C:\Users\username\project\output\george.xlsx',
    'C:\Users\username\project\output\requirements for project.txt',
    'C:\Users\username\project\output\ringo.xlsx' ] 

From here, i'd like to be able to do something like

for i in files:
    if ' ' not in str(i.name):
        str(i.name.strip('.xlsx'))) = pd.read_excel(i)

read the file name, if it doesn't contain any spaces, take the name, remove the file extension and use that as the variable name for a pandas dataframe built from the excel file.

If what I'm doing isn't possible then I have other ways to do it, but they repeat a lot of code.

Any help is appreciated.


Solution

  • using pathlib and re

    we can exclude any files that match a certain pattern in our dictionary comprehension, that is any files with a space.

    from pathlib import Path
    import re 
    import pandas as pd
    
    pth = (r'C:\Users\username\project\output')
    
    files = Path(pth).glob('*.xlsx') # use `rglob` if you want to to trawl a directory.
    
    
    dfs = {file.stem : pd.read_excel(file) for file in 
                       files if not re.search('\s', file.stem)}
    

    based on the above you'll get :

    {'john': pandas.core.frame.DataFrame,
     'paul': pandas.core.frame.DataFrame,
     'george': pandas.core.frame.DataFrame,
     'ringo': pandas.core.frame.DataFrame}
    

    where pandas.core.frame.DataFrame is your target dataframe.

    you can then call them by doing dfs['john']