Search code examples
python-3.xpandasdatetimepython-datetimestring-to-datetime

Pandas to_datetime parsing for oddly formatted date


TL;DR:

How can I use format parameter in pd.to_datetime() to parse a date that is formatted as 'YYYY mmdd dd'?

Background:

I am pulling some data from an api and the index is a date but is formatted as a string. Here's what the index looks like:

Index(['2020 0422 22', '2020 0423 23', '2020 0424 24', '2020 0427 27',
       '2020 0428 28'],
      dtype='object')

Option 1:

Normally I would do something like df.index = pd.to_datetime(df.index) and that would normally work like a charm. If that doesn't work, then I would use the format parameter to parse like this pd.to_datetime(format='%y/%m/%d) however given the fact that the day is included twice, I'm unsure how to use this parameter.

Option 2:

I could also use a string comprehension to create a new list, convert that to datetime, then set the index of df equal to that. Something like this:

[ x.split(' ')[0] + x.split(' ')[1] for x in df.tail().index ]

or

[ x[0:-2] for x in df.tail().index ]

But none of these options seem very pythonic

The question:

How can I use format parameter in pd.to_datetime() to parse a date that is formatted as 'YYYY mmdd dd'?


Solution

  • Use str.rsplit with n=1 and selecting first list and then pass to to_datetime:

    idx = pd.Index(['2020 0422 22', '2020 0423 23', '2020 0424 24', '2020 0427 27',
           '2020 0428 28'])
    df = pd.DataFrame(index = idx)
    
    df.index = pd.to_datetime(df.index.str.rsplit(n=1).str[0], format='%Y %m%d')
    print (df)
    
    Empty DataFrame
    Columns: []
    Index: [2020-04-22 00:00:00, 2020-04-23 00:00:00, 
            2020-04-24 00:00:00, 2020-04-27 00:00:00, 
            2020-04-28 00:00:00]
    

    Btw, most intuitive answer failed:

    df.index = pd.to_datetime(df.index, format='%Y %m%d %d')
    print (df)
    

    error: redefinition of group name 'd' as group 4; was group 3