Search code examples
pythonpandasoperating-systemglob

How can I combine files with matching ending characters?


I have excel files named like "name1 01.01.2018.xlsx", "name1 01.01.2018.xlsx", "name2 12.23.2019.xlsx", and so on. I want to join all files with matching dates (last 10 characters).

I can join all of them by doing:

import glob
import os
import pandas

os.chdir('filepath')
files = [pd.read_excel(p, skipfooter=1) for p in glob.glob("*.xlsx")]
df = files[0].drop(files[0].tail(0).index).append([files[i].drop(files[i].tail(0).index) for i in range(1,len(files))])

How can I join only when the last characters match? I don't have a list of options for the last 10 characters, I want it to update automatically.


Solution

  • Well, first off, we need to reformat your code a bit. While the line to join the Dataframes is correct, it's very difficult to read and can be accomplished more efficiently:

    import glob
    import os
    import pandas as pd
    
    os.chdir('filepath')
    files = [pd.read_excel(p, skipfooter=1) for p in glob.glob("*.xlsx")]
    
    # drop the tail of all files
    files = [f.drop(f.tail(0).index) for f in files]
    
    # join all files
    df = files[0].append(files[1:])
    

    Then, we need to update it a bit so that you can check the filename of the files you loaded, and have some way to tie them back to the Dataframe you created.

    import glob
    import os
    import pandas as pd
    
    os.chdir('filepath')
    
    # store last 10 characters of original filename
    files = [(p[-10:], pd.read_excel(p, skipfooter=1)) for p in glob.glob("*.xlsx")]
    
    # drop the tail of all files
    files = [(p, f.drop(f.tail(0).index)) for p, f in files]
    
    # group files by last 10 characters of filename
    files = {p: [g for n, g in files if n == p] for p in set(p for p, f in files)}
    
    # join all files with same last 10 characters
    for key, value in files.items():
        files[key] = value[0].append(value[1:])