I have a dataset with mixed datetime formats and strings in the date columns. I am trying to standardize the date in the columns to a regular datetime format.
I tried combining these solutions (Clean a Messy Date Column with Mixed Formats in Pandas),(convert pandas datetime column yyyy-mm-dd to YYYYMMDD), (Working with mixed datetime formats in pandas).
I tried converting the string that has no separators first and then running the conversion function, but the I'm getting the wrong year (for example 20110912 is converted to 1970/01/01). There are so many inconsistencies, so I'm not sure which method would work best. Any suggestions would be greatly appreciated!
import numpy as np
import pandas as pd
df = pd.DataFrame({'date':[20110912.0, 20230102, '10/10/17', '4/8/14',
'7/28/2020', '20121001', 2023.01.02',
'2019-04-23 0:00:00', '2011-12-21 0:00:00',
'07/28/14', '', 'NaN' ]})
s = df['date'].copy()
mask = s.apply(type) == float
s.loc[mask] = s.loc[mask].astype(int)
newdf = df.assign(date=pd.to_datetime(s.astype(str),errors='coerce').dt.date)
My code keeps throwing an error at the line to convert integers. ValueError: cannot convert float NaN to integer
Edit: After some forth and back, it seems like there really is quite a bit of noise in that data. To list a few kinds of noise that surfaced up so far:
'%Y-%m-%d %h:%M:%S'
, '%m/%D/%y'
, etc.);int
values such as 20230320
;float
values such as 20230320.0
;'NaN', '', ...
);NaN
as float
).In this modified answer, we will:
int
;str
;pd.to_datetime(..., errors='coerce')
;utc=True
to bring all datetimes on a common timezone (UTC), then .dt.localize(None)
to make them all tz-naive;NaN
in the initial data, but NaT
in the result, to review any opportunity to get more valid data through this cleaning process.I should also mention some unexpected behavior (pandas
versions tried: from 1.4.2 to 1.5.3) of pd.to_datetime
in the result type as noted in this bug report (I did not expected to see float('NaN')
and str('NaT')
in the results when arg errors='coerce'
is passed).
We expand a bit on the OP's example to include some other formats and data issues:
df = pd.DataFrame({'date':[
20110912.0, 20230102, '10/10/17', '4/8/14', '7/28/2020', '20121001',
'2023.01.02', '2019-04-23 0:00:00', '2011-12-21 0:00:00', '07/28/14',
'Mon Mar 20 11:03:10 UTC 2023', '', 'NaN', 'foo', float('NaN'),
]})
from datetime import datetime
s = df['date'].copy()
mask = (s.apply(type) == float) & ~s.isna()
s.loc[mask] = s.loc[mask].astype(int)
s = s.astype(str)
s2 = pd.to_datetime(s, errors='coerce')
# see github.com/pandas-dev/pandas/issues/52094
is_bad = s2.apply(type) != datetime
s2.loc[is_bad] = pd.NaT
has_tz = (
s2[~is_bad]
.apply(datetime.tzname).astype(bool)
.reindex(s.index, fill_value=False)
)
if has_tz.any():
# we convert a second time, to get all datetimes
# to a common tz: utc, then make all tz-naive
s3 = pd.to_datetime(
s, errors='coerce', utc=True).dt.tz_localize(None)
else:
s3 = s2
# keep only dates; an alternatively would be to
# use .dt.normalize() to truncate down to mindnight
newdf = df.assign(date=s3.dt.date)
With this:
>>> newdf
date
0 2011-09-12
1 2023-01-02
2 2017-10-10
3 2014-04-08
4 2020-07-28
5 2012-10-01
6 2023-01-02
7 2019-04-23
8 2011-12-21
9 2014-07-28
10 2023-03-20
11 NaT
12 NaT
13 NaT
14 NaT
And, for inspection:
>>> df.loc[is_bad, 'date'].apply(repr).value_counts()
'' 1
'NaN' 1
'foo' 1
nan 1
Also, if you are curious about the intermediary values in the cleaning (and to inspect what happens to tz-aware datetimes):
>>> pd.concat([s, is_bad, s2, has_tz, s3], axis=1, keys='s is_bad s2 has_tz s3'.split())
s is_bad s2 has_tz s3
0 20110912 False 2011-09-12 00:00:00 False 2011-09-12 00:00:00
1 20230102 False 2023-01-02 00:00:00 False 2023-01-02 00:00:00
2 10/10/17 False 2017-10-10 00:00:00 False 2017-10-10 00:00:00
3 4/8/14 False 2014-04-08 00:00:00 False 2014-04-08 00:00:00
4 7/28/2020 False 2020-07-28 00:00:00 False 2020-07-28 00:00:00
5 20121001 False 2012-10-01 00:00:00 False 2012-10-01 00:00:00
6 2023.01.02 False 2023-01-02 00:00:00 False 2023-01-02 00:00:00
7 2019-04-23 0:00:00 False 2019-04-23 00:00:00 False 2019-04-23 00:00:00
8 2011-12-21 0:00:00 False 2011-12-21 00:00:00 False 2011-12-21 00:00:00
9 07/28/14 False 2014-07-28 00:00:00 False 2014-07-28 00:00:00
10 Mon Mar 20 11:03:10 UTC 2023 False 2023-03-20 11:03:10+00:00 True 2023-03-20 11:03:10
11 True NaT False NaT
12 NaN True NaT False NaT
13 foo True NaT False NaT
14 nan True NaT False NaT