I have a folder with hundreds of .xlsx files which are daily data from years ago.
The file name pattern is 'ABC DEF 20150501.xlsx'
How can I pick files from the last day of each months for all these years.
Sadly I have no clue how to do it.
import pandas as pd
import re
files = [f for f in os.listdir(path) if re.match(r'ABC DEF [0-9]+.xlsx', f)]
df = pd.DataFrame(files)
df.columns = ['file_name']
df['port_date'] = df['file_name'].apply(lambda x : x.split(' ')[3].split('.')[0])
I got only the date in one column, and run out of idea!
Any help will be appreciated.
If need last row for each month and year extract datetimes, sorting and gruping by years and months with GroupBy.last
:
df = pd.DataFrame({'file_name':['ABC DEF 20150501.xlsx',
'ABC DEF 20150701.xlsx',
'ABC DEF 20150711.xlsx']})
print (df)
file_name
0 ABC DEF 20150501.xlsx
1 ABC DEF 20150701.xlsx
2 ABC DEF 20150711.xlsx
df['port_date'] = pd.to_datetime(df['file_name'].str.extract('(\d+)\.', expand=False))
df = df.sort_values('port_date')
df = (df.groupby([df['port_date'].dt.year, df['port_date'].dt.month])
.last()
.reset_index(drop=True))
print (df)
file_name port_date
0 ABC DEF 20150501.xlsx 2015-05-01
1 ABC DEF 20150711.xlsx 2015-07-11