TL;DR:
How can I use format parameter in pd.to_datetime() to parse a date that is formatted as 'YYYY mmdd dd'?
Background:
I am pulling some data from an api and the index is a date but is formatted as a string. Here's what the index looks like:
Index(['2020 0422 22', '2020 0423 23', '2020 0424 24', '2020 0427 27',
'2020 0428 28'],
dtype='object')
Option 1:
Normally I would do something like df.index = pd.to_datetime(df.index)
and that would normally work like a charm. If that doesn't work, then I would use the format parameter to parse like this pd.to_datetime(format='%y/%m/%d)
however given the fact that the day is included twice, I'm unsure how to use this parameter.
Option 2:
I could also use a string comprehension to create a new list, convert that to datetime, then set the index of df equal to that. Something like this:
[ x.split(' ')[0] + x.split(' ')[1] for x in df.tail().index ]
or
[ x[0:-2] for x in df.tail().index ]
But none of these options seem very pythonic
The question:
How can I use format parameter in pd.to_datetime() to parse a date that is formatted as 'YYYY mmdd dd'?
Use str.rsplit
with n=1
and selecting first list and then pass to to_datetime
:
idx = pd.Index(['2020 0422 22', '2020 0423 23', '2020 0424 24', '2020 0427 27',
'2020 0428 28'])
df = pd.DataFrame(index = idx)
df.index = pd.to_datetime(df.index.str.rsplit(n=1).str[0], format='%Y %m%d')
print (df)
Empty DataFrame
Columns: []
Index: [2020-04-22 00:00:00, 2020-04-23 00:00:00,
2020-04-24 00:00:00, 2020-04-27 00:00:00,
2020-04-28 00:00:00]
Btw, most intuitive answer failed:
df.index = pd.to_datetime(df.index, format='%Y %m%d %d')
print (df)
error: redefinition of group name 'd' as group 4; was group 3