Search code examples
python-3.xpandasdatedatetimeiso

How to parse correctly to date in python from specific ISO format


I connected to a table on a db where there are two columns with dates. I had no problem to parse the column with values formatted as this: 2017-11-03
But I don't find a way to parse the other column with dates formatted as this: 2017-10-03 05:06:52.840 +02:00

My attempts
If I parse a single value through the strptime method

dt.datetime.strptime("2017-12-14 22:16:24.037 +02:00", "%Y-%m-%d %H:%M:%S.%f %z")

I get the correct output
datetime.datetime(2017, 12, 14, 22, 16, 24, 37000, tzinfo=datetime.timezone(datetime.timedelta(seconds=7200)))

but if I try to use the same code format while parsing the table to the dataframe, the column dtype is an object:

Licenze_FromGY = pd.read_sql(query, cnxn, parse_dates={"EndDate":"%Y-%m-%d", "LastUpd":"%Y-%m-%d %H:%M:%S.%f %z"})


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   Tenant     1000 non-null   int64         
 1   IdService  1000 non-null   object        
 2   Code       1000 non-null   object        
 3   Aggregate  1000 non-null   object        
 4   Bundle     991 non-null    object        
 5   Status     1000 non-null   object        
 6   Value      1000 non-null   int64         
 7   EndDate    258 non-null    datetime64[ns]
 8   Trial      1000 non-null   bool          
 9   LastUpd    1000 non-null   object 

I also tried to change the code format either in the read_sql method or in the pd.to_datetime() method, but then all the values become NaT:

Licenze_FromGY["LastUpd"] = pd.to_datetime(Licenze_FromGY["LastUpd"], format="%Y-%m-%d %H:%M:%S.%fZ", errors="coerce") 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   Tenant     1000 non-null   int64         
 1   IdService  1000 non-null   object        
 2   Code       1000 non-null   object        
 3   Aggregate  1000 non-null   object        
 4   Bundle     991 non-null    object        
 5   Status     1000 non-null   object        
 6   Value      1000 non-null   int64         
 7   EndDate    258 non-null    datetime64[ns]
 8   Trial      1000 non-null   bool          
 9   LastUpd    0 non-null      datetime64[ns]
dtypes: bool(1), datetime64[ns](2), int64(2), object(5)
memory usage: 71.4+ KB
None 

Anyone can help?


Solution

  • pandas cannot handle mixed UTC offsets in one Series (column). Assuming you have data like this

    import pandas as pd
    
    df = pd.DataFrame({"datetime": ["2017-12-14 22:16:24.037 +02:00", 
                                    "2018-08-14 22:16:24.037 +03:00"]})
    

    if you just parse to datetime,

    df["datetime"] = pd.to_datetime(df["datetime"])
    
    df["datetime"]
    0    2017-12-14 22:16:24.037000+02:00
    1    2018-08-14 22:16:24.037000+03:00
    Name: datetime, dtype: object
    

    you get dtype object. The elements of the series are of the Python datetime.datetime dtype. That limits the datetime functionality, compared to the pandas datetime dtype.

    You can get that e.g. by parsing to UTC:

    df["datetime"] = pd.to_datetime(df["datetime"], utc=True)
    
    df["datetime"]
    0   2017-12-14 20:16:24.037000+00:00
    1   2018-08-14 19:16:24.037000+00:00
    Name: datetime, dtype: datetime64[ns, UTC]
    

    You might set an appropriate time zone to re-create the UTC offset:

    df["datetime"] = pd.to_datetime(df["datetime"], utc=True).dt.tz_convert("Europe/Athens")
    
    df["datetime"]
    0   2017-12-14 22:16:24.037000+02:00
    1   2018-08-14 22:16:24.037000+03:00
    Name: datetime, dtype: datetime64[ns, Europe/Athens]