Search code examples
pythonpandasnumpydatetimedata-cleaning

Standardize mixed datetime format in pandas dataframe that includes strings


I have a dataset with mixed datetime formats and strings in the date columns. I am trying to standardize the date in the columns to a regular datetime format.

I tried combining these solutions (Clean a Messy Date Column with Mixed Formats in Pandas),(convert pandas datetime column yyyy-mm-dd to YYYYMMDD), (Working with mixed datetime formats in pandas).

I tried converting the string that has no separators first and then running the conversion function, but the I'm getting the wrong year (for example 20110912 is converted to 1970/01/01). There are so many inconsistencies, so I'm not sure which method would work best. Any suggestions would be greatly appreciated!

import numpy as np
import pandas as pd


df = pd.DataFrame({'date':[20110912.0, 20230102, '10/10/17', '4/8/14',
                           '7/28/2020', '20121001', 2023.01.02',
                           '2019-04-23 0:00:00', '2011-12-21 0:00:00', 
                           '07/28/14', '', 'NaN' ]})


s = df['date'].copy()
mask = s.apply(type) == float
s.loc[mask] = s.loc[mask].astype(int)

newdf = df.assign(date=pd.to_datetime(s.astype(str),errors='coerce').dt.date)

My code keeps throwing an error at the line to convert integers. ValueError: cannot convert float NaN to integer


Solution

  • Edit: After some forth and back, it seems like there really is quite a bit of noise in that data. To list a few kinds of noise that surfaced up so far:

    • mixed (but valid) date formats (e.g. '%Y-%m-%d %h:%M:%S', '%m/%D/%y', etc.);
    • dates with indication of time zone, others without;
    • int values such as 20230320;
    • float values such as 20230320.0;
    • random strings (e.g. 'NaN', '', ...);
    • missing values (NaN as float).

    In this modified answer, we will:

    1. convert non-NaN floats to int;
    2. convert the whole column to type str;
    3. use pd.to_datetime(..., errors='coerce');
    4. see if any datetime is tz-aware; if so, use utc=True to bring all datetimes on a common timezone (UTC), then .dt.localize(None) to make them all tz-naive;
    5. inspect the values that are not NaN in the initial data, but NaT in the result, to review any opportunity to get more valid data through this cleaning process.

    I should also mention some unexpected behavior (pandas versions tried: from 1.4.2 to 1.5.3) of pd.to_datetime in the result type as noted in this bug report (I did not expected to see float('NaN') and str('NaT') in the results when arg errors='coerce' is passed).

    Reproducible example

    We expand a bit on the OP's example to include some other formats and data issues:

    df = pd.DataFrame({'date':[
        20110912.0, 20230102, '10/10/17', '4/8/14', '7/28/2020', '20121001',
        '2023.01.02', '2019-04-23 0:00:00', '2011-12-21 0:00:00', '07/28/14',
        'Mon Mar 20 11:03:10 UTC 2023', '', 'NaN', 'foo', float('NaN'),
    ]})
    

    Cleaning process

    from datetime import datetime
    
    s = df['date'].copy()
    mask = (s.apply(type) == float) & ~s.isna()
    s.loc[mask] = s.loc[mask].astype(int)
    s = s.astype(str)
    s2 = pd.to_datetime(s, errors='coerce')
    
    # see github.com/pandas-dev/pandas/issues/52094
    is_bad = s2.apply(type) != datetime
    s2.loc[is_bad] = pd.NaT
    
    has_tz = (
        s2[~is_bad]
        .apply(datetime.tzname).astype(bool)
        .reindex(s.index, fill_value=False)
    )
    if has_tz.any():
        # we convert a second time, to get all datetimes
        # to a common tz: utc, then make all tz-naive
        s3 = pd.to_datetime(
            s, errors='coerce', utc=True).dt.tz_localize(None)
    else:
        s3 = s2
    
    # keep only dates; an alternatively would be to
    # use .dt.normalize() to truncate down to mindnight
    newdf = df.assign(date=s3.dt.date)
    

    With this:

    >>> newdf
             date
    0  2011-09-12
    1  2023-01-02
    2  2017-10-10
    3  2014-04-08
    4  2020-07-28
    5  2012-10-01
    6  2023-01-02
    7  2019-04-23
    8  2011-12-21
    9  2014-07-28
    10 2023-03-20
    11        NaT
    12        NaT
    13        NaT
    14        NaT
    

    And, for inspection:

    >>> df.loc[is_bad, 'date'].apply(repr).value_counts()
    ''       1
    'NaN'    1
    'foo'    1
    nan      1
    

    Also, if you are curious about the intermediary values in the cleaning (and to inspect what happens to tz-aware datetimes):

    >>> pd.concat([s, is_bad, s2, has_tz, s3], axis=1, keys='s is_bad s2 has_tz s3'.split())
                                   s  is_bad                         s2  has_tz                  s3
    0                       20110912   False        2011-09-12 00:00:00   False 2011-09-12 00:00:00
    1                       20230102   False        2023-01-02 00:00:00   False 2023-01-02 00:00:00
    2                       10/10/17   False        2017-10-10 00:00:00   False 2017-10-10 00:00:00
    3                         4/8/14   False        2014-04-08 00:00:00   False 2014-04-08 00:00:00
    4                      7/28/2020   False        2020-07-28 00:00:00   False 2020-07-28 00:00:00
    5                       20121001   False        2012-10-01 00:00:00   False 2012-10-01 00:00:00
    6                     2023.01.02   False        2023-01-02 00:00:00   False 2023-01-02 00:00:00
    7             2019-04-23 0:00:00   False        2019-04-23 00:00:00   False 2019-04-23 00:00:00
    8             2011-12-21 0:00:00   False        2011-12-21 00:00:00   False 2011-12-21 00:00:00
    9                       07/28/14   False        2014-07-28 00:00:00   False 2014-07-28 00:00:00
    10  Mon Mar 20 11:03:10 UTC 2023   False  2023-03-20 11:03:10+00:00    True 2023-03-20 11:03:10
    11                                  True                        NaT   False                 NaT
    12                           NaN    True                        NaT   False                 NaT
    13                           foo    True                        NaT   False                 NaT
    14                           nan    True                        NaT   False                 NaT