I have excel files named like "name1 01.01.2018.xlsx", "name1 01.01.2018.xlsx", "name2 12.23.2019.xlsx", and so on. I want to join all files with matching dates (last 10 characters).
I can join all of them by doing:
import glob
import os
import pandas
os.chdir('filepath')
files = [pd.read_excel(p, skipfooter=1) for p in glob.glob("*.xlsx")]
df = files[0].drop(files[0].tail(0).index).append([files[i].drop(files[i].tail(0).index) for i in range(1,len(files))])
How can I join only when the last characters match? I don't have a list of options for the last 10 characters, I want it to update automatically.
Well, first off, we need to reformat your code a bit. While the line to join the Dataframes is correct, it's very difficult to read and can be accomplished more efficiently:
import glob
import os
import pandas as pd
os.chdir('filepath')
files = [pd.read_excel(p, skipfooter=1) for p in glob.glob("*.xlsx")]
# drop the tail of all files
files = [f.drop(f.tail(0).index) for f in files]
# join all files
df = files[0].append(files[1:])
Then, we need to update it a bit so that you can check the filename of the files you loaded, and have some way to tie them back to the Dataframe you created.
import glob
import os
import pandas as pd
os.chdir('filepath')
# store last 10 characters of original filename
files = [(p[-10:], pd.read_excel(p, skipfooter=1)) for p in glob.glob("*.xlsx")]
# drop the tail of all files
files = [(p, f.drop(f.tail(0).index)) for p, f in files]
# group files by last 10 characters of filename
files = {p: [g for n, g in files if n == p] for p in set(p for p, f in files)}
# join all files with same last 10 characters
for key, value in files.items():
files[key] = value[0].append(value[1:])