I am currently working on a project where I need to validate that some dates in a python dataframe are respecting a specific format such as days[any divider]month[any divider]year
. I've decided to use pandas.to_datetime() since it was solving my problems in a pretty nice way. Here's an example of what I've done:
foo = pd.Series(['01-01-2023','JAN-01-2023'])
pd.to_datetime(foo,dayfirst=True,errors='coerce').notnull()
>> 0 True
1 False
However, once I change the order of this series and uses 'JAN-01-2023' as the first element, here's what happens:
foo = pd.Series(['JAN-01-2023','01-01-2023'])
pd.to_datetime(foo,dayfirst=True,errors='coerce').notnull()
>> 0 True
1 True
My current workaround is not to use to_datetime, but .apply
with custom validation function which performs for...in.. loop onto dataframe's column, however I've checked some performance issues using .apply
since it iterates every row on the series.
I'm not sure this is some kind of misinterpretation about to_datetime documentation or if it's really a bug. I could not find any issue on pandas GitHub, neither here on stack overflow. Has anybody ever ran into this?
System Settings:
windows 10 x64;
python version 3.8.10;
pandas version pandas 2.0.3
Checked issues on stack overflow.
Checked issues on pandas Repository on GitHub
Checked to_datetime documentation
pandas.to_datetime() will use the first string to discover the format.
# Format %d-%m-%y (not valid for JAN-01-2023)
foo = pd.Series(['01-01-2023', 'JAN-01-2023'])
pd.to_datetime(foo, dayfirst=True, errors='coerce').notnull()
# Format %m-%d-%y (valid for both)
foo = pd.Series(['JAN-01-2023', '01-01-2023'])
pd.to_datetime(foo, dayfirst=True, errors='coerce').notnull()
If you will have multiple formats, you probably want to pass the parameter format='mixed'
pd.to_datetime(foo, format='mixed', dayfirst=True, errors='coerce')
In cases where the date is ambiguous like 01-01-2023
, passing the parameter dayfirst
as True
will give priority to formats where day come first (there is also yearfirst
).