I have a long timeserie where each year is archived within a folder that corresponds to this year. But, in each folder, data is not recorded in a single file, instead, in monthly files.
e.g. 1954 > APR, AUG, DEC... SEP
When I import these files using Glob and create a Dataframe with Pandas they're imported in this same order (above). But, instead, I need a correct months sequence (JAN, FEB, MAR..) for plotting and working with'em. So, my question is:
There is any way to force Glob importing files in a specific order or, maybe, rearrange files using Pandas?
path = r'path'
allFiles = glob.glob(path+"/*.dtf")
df = pd.DataFrame()
list_ = []
for file_ in allFiles:
df = pd.read_csv(file_,header = None,sep=r"\s*")
list_.append(df)
df = pd.concat(list_)
Thanks.
You can use concat
with parameter keys
with filenames:
Data for testing are here.
path = r'path-dtfs'
#add /* for read subfolders
allFiles = glob.glob(path+"/*/*.dtf")
print (allFiles)
['path\\1954\\FEB.dtf', 'path\\1954\\JAN.dtf', 'path\\1955\\APR.dtf', 'path\\1955\\MAR.dtf']
list_ = []
for file_ in allFiles:
df = pd.read_csv(file_,header = None,sep=r"\s+")
list_.append(df)
Then create new columns by split
and insert
. For correct sorting need ordered categorical with sort_values
:
df = pd.concat(list_, keys=allFiles)
.reset_index(level=1, drop=True)
.rename_axis('years').reset_index()
s = df['years'].str.split('\\')
df['years'] = s.str[-2].astype(int)
df.insert(1, 'months', s.str[-1].str.replace('.dtf', ''))
#add all missing months
cats = ['JAN','FEB','MAR','APR']
df['months'] = df['months'].astype('category', categories=cats, ordered=True)
df = df.sort_values(['years','months']).reset_index(drop=True)
print (df)
years months 0 1 2
0 1954 JAN 0 1 2
1 1954 JAN 1 5 8
2 1954 FEB 0 9 6
3 1954 FEB 1 6 4
4 1955 MAR 5 6 8
5 1955 MAR 4 7 9
6 1955 APR 0 3 6
7 1955 APR 1 4 1
Another solution is create datetime
columns by str.extract
with to_datetime
:
df = pd.concat(list_, keys=allFiles)
.reset_index(level=1, drop=True)
.rename_axis('dates')
.reset_index()
df['dates'] = df['dates'].str.extract('path\\\(.*).dtf', expand=False)
df['dates'] = pd.to_datetime(df['dates'], format='%Y\%b')
df = df.sort_values('dates').reset_index(drop=True)
print (df)
dates 0 1 2
0 1954-01-01 0 1 2
1 1954-01-01 1 5 8
2 1954-02-01 0 9 6
3 1954-02-01 1 6 4
4 1955-03-01 5 6 8
5 1955-03-01 4 7 9
6 1955-04-01 0 3 6
7 1955-04-01 1 4 1
Similar solution is use month period
by to_period
:
df = pd.concat(list_, keys=allFiles)
.reset_index(level=1, drop=True)
.rename_axis('periods').reset_index()
df['periods'] = df['periods'].str.extract('path\\\(.*).dtf', expand=False)
df['periods'] = pd.to_datetime(df['periods'], format='%Y\%b').dt.to_period('M')
df = df.sort_values('periods').reset_index(drop=True)
print (df)
periods 0 1 2
0 1954-01 0 1 2
1 1954-01 1 5 8
2 1954-02 0 9 6
3 1954-02 1 6 4
4 1955-03 5 6 8
5 1955-03 4 7 9
6 1955-04 0 3 6
7 1955-04 1 4 1