Search code examples
pythonpandasloopsdatepython-re

How can I find only the files that contains date for last day of the month, using Python?


I have a folder with hundreds of .xlsx files which are daily data from years ago.

The file name pattern is 'ABC DEF 20150501.xlsx'

How can I pick files from the last day of each months for all these years.

Sadly I have no clue how to do it.

import pandas as pd
import re

files = [f for f in os.listdir(path) if re.match(r'ABC DEF [0-9]+.xlsx', f)]
df = pd.DataFrame(files)
df.columns = ['file_name']
df['port_date'] = df['file_name'].apply(lambda x : x.split(' ')[3].split('.')[0])

I got only the date in one column, and run out of idea!

Any help will be appreciated.


Solution

  • If need last row for each month and year extract datetimes, sorting and gruping by years and months with GroupBy.last:

    df = pd.DataFrame({'file_name':['ABC DEF 20150501.xlsx',
                                    'ABC DEF 20150701.xlsx',
                                    'ABC DEF 20150711.xlsx']})
    
    print (df)
                   file_name
    0  ABC DEF 20150501.xlsx
    1  ABC DEF 20150701.xlsx
    2  ABC DEF 20150711.xlsx
    
    df['port_date'] = pd.to_datetime(df['file_name'].str.extract('(\d+)\.', expand=False))
    df = df.sort_values('port_date')
    
    df = (df.groupby([df['port_date'].dt.year, df['port_date'].dt.month])
            .last()
            .reset_index(drop=True))
    print (df)
                   file_name  port_date
    0  ABC DEF 20150501.xlsx 2015-05-01
    1  ABC DEF 20150711.xlsx 2015-07-11