Search code examples
pythonpandas

How to convert list of calendar dates into to_datetime in pandas


I have some function which returns list of holidays. The list looks so

['30 May 2024','1 May 2024', '29 Aug 2024', '14 Aug 2024', '19 May 2024'] 

When iam trying to do

print(pd.to_datetime(['30 May 2024','1 May 2024', '29 Aug 2024', '14 Aug 2024', '19 May 2024']))

Error

============================================================================================== RESTART: C:\Users\Bhargav\Downloads\gapi.py =============================================================================================
Traceback (most recent call last):
  File "C:\Users\Bhargav\Downloads\gapi.py", line 2, in <module>
    print(pd.to_datetime(['30 May 2024','1 May 2024', '29 Aug 2024', '14 Aug 2024', '19 May 2024']))
  File "C:\Users\Bhargav\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\core\tools\datetimes.py", line 1099, in to_datetime
    result = convert_listlike(argc, format)
  File "C:\Users\Bhargav\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\core\tools\datetimes.py", line 433, in _convert_listlike_datetimes
    return _array_strptime_with_fallback(arg, name, utc, format, exact, errors)
  File "C:\Users\Bhargav\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\core\tools\datetimes.py", line 467, in _array_strptime_with_fallback
    result, tz_out = array_strptime(arg, fmt, exact=exact, errors=errors, utc=utc)
  File "strptime.pyx", line 501, in pandas._libs.tslibs.strptime.array_strptime
  File "strptime.pyx", line 451, in pandas._libs.tslibs.strptime.array_strptime
  File "strptime.pyx", line 583, in pandas._libs.tslibs.strptime._parse_with_format
ValueError: time data "29 Aug 2024" doesn't match format "%d %B %Y", at position 2. You might want to try:
    - passing `format` if your strings have a consistent format;
    - passing `format='ISO8601'` if your strings are all ISO8601 but not necessarily in exactly the same format;
    - passing `format='mixed'`, and the format will be inferred for each element individually. You might want to use `dayfirst` alongside this.

But But it works perfect with this list

print(pd.to_datetime(['30 Dec 2024','1 May 2024', '29 Aug 2024', '14 Aug 2024', '19 May 2024']))

I get

============================================================================================== RESTART: C:\Users\Bhargav\Downloads\gapi.py =============================================================================================
DatetimeIndex(['2024-12-30', '2024-05-01', '2024-08-29', '2024-08-14',
               '2024-05-19'],
          dtype='datetime64[ns]', freq=None)

Am i missing anything here??


Solution

  • The issue is due to having May as a first date. May is both the full (%B) and abbreviated (%b) name of the month.

    When pandas tried to infer the format, it decided to use %d %B %Y, which is correct for May but not for Aug (it would have needed %b). As described here this is done by pandas.tseries.api.guess_datetime_format:

    from pandas.tseries.api import guess_datetime_format
    
    guess_datetime_format('30 May 2024') # '%d %B %Y'
    

    When passing Dec/Aug as a first date, this doesn't cause the issue:

    guess_datetime_format('30 Dec 2024') # '%d %b %Y'
    

    If you use format='mixed' or format='%d %b %Y', this will work fine:

    pd.to_datetime(['30 May 2024','1 May 2024', '29 Aug 2024', '14 Aug 2024', '19 May 2024'], format='mixed')
    
    DatetimeIndex(['2024-05-30', '2024-05-01', '2024-08-29', '2024-08-14',
                   '2024-05-19'],
                  dtype='datetime64[ns]', freq=None)