I have a folder with 6 files, 4 are excel files that I would like to bring into pandas and 2 are just other files. I want to be able to use pathlib
to work with the folder to automatically ingest the excel files I want into individual pandas dataframes. I would also like to be able to name each new dataframe with the name of the excel file (without the file extension)
for example.
import pandas as pd
import pathlib as pl
folder = pl.WindowsPath(r'C:\Users\username\project\output')
files = [e for e in folder.iterdir()]
for i in files:
print(i)
['C:\Users\username\project\output\john.xlsx',
'C:\Users\username\project\output\paul.xlsx',
'C:\Users\username\project\output\random other file not for df.xlsx',
'C:\Users\username\project\output\george.xlsx',
'C:\Users\username\project\output\requirements for project.txt',
'C:\Users\username\project\output\ringo.xlsx' ]
From here, i'd like to be able to do something like
for i in files:
if ' ' not in str(i.name):
str(i.name.strip('.xlsx'))) = pd.read_excel(i)
read the file name, if it doesn't contain any spaces, take the name, remove the file extension and use that as the variable name for a pandas dataframe built from the excel file.
If what I'm doing isn't possible then I have other ways to do it, but they repeat a lot of code.
Any help is appreciated.
using pathlib
and re
we can exclude any files that match a certain pattern in our dictionary comprehension, that is any files with a space.
from pathlib import Path
import re
import pandas as pd
pth = (r'C:\Users\username\project\output')
files = Path(pth).glob('*.xlsx') # use `rglob` if you want to to trawl a directory.
dfs = {file.stem : pd.read_excel(file) for file in
files if not re.search('\s', file.stem)}
based on the above you'll get :
{'john': pandas.core.frame.DataFrame,
'paul': pandas.core.frame.DataFrame,
'george': pandas.core.frame.DataFrame,
'ringo': pandas.core.frame.DataFrame}
where pandas.core.frame.DataFrame
is your target dataframe.
you can then call them by doing dfs['john']