Search code examples
pythonpandasdatetimepytzdatetimeoffset

TimeSeries conversion from CET / CEST to UTC


I have two time series file which is meant to be in CET / CEST. The bad one of them, does not write the values in a proper way. For the good csv, see here:

#test_good.csv
local_time,value
...
2017-03-26 00:00,2016
2017-03-26 01:00,2017
2017-03-26 03:00,2018
2017-03-26 04:00,2019
...
2017-10-29 01:00,7224
2017-10-29 02:00,7225
2017-10-29 02:00,7226
2017-10-29 03:00,7227
...

...everything works fine by using:

        df['utc_time'] = pd.to_datetime(df[local_time_column])
                            .dt.tz_localize('CET', ambiguous="infer")
                            .dt.tz_convert('UTC').dt.strftime('%Y-%m-%d %H:%M:%S')

When converting the test_bad.csv to UTC I get an AmbiguousTimeError as the 2 hours in October are missing.

# test_bad.csv
local_time,value
...
2017-03-26 00:00,2016
2017-03-26 01:00,2017   # everything is as it should be
2017-03-26 03:00,2018
2017-03-26 04:00,2019
...
2017-10-29 01:00,7223
2017-10-29 02:00,7224   # the value of 2 am should actually be repeated PLUS 3 am is missing
2017-10-29 04:00,7226
2017-10-29 05:00,7227
...

Does anyone know an elegant way of how to still convert the time series file to UTC and add NaN columns for the missing dates in the new index? Thanks for your help.


Solution

  • Elaborating a bit on Mark Ransom's comment;

    2017-10-29 02:00,7224 
    

    is ambiguous; it could be 2017-10-29 00:00 UTC or 2017-10-29 01:00 UTC. That's why pd.to_datetime refuses to infer anything.

    With the help of some native Python, you can work around. Assuming you just loaded the csv to a df without parsing anything to datetime, you can continue like

    from datetime import datetime
    import pytz
    
    df['local_time'] = [pytz.timezone('Europe/Berlin').localize(datetime.fromisoformat(t)) for t in df['local_time']]
    
    # so you can make a UTC index:
    df.set_index(df['local_time'].dt.tz_convert('UTC'), inplace=True)
    
    # Now you can create a new, hourly index from that and re-index:
    dti = pd.date_range(df.index[0], df.index[-1], freq='H')
    df2 = df.reindex(dti)
    
    # for comparison, the "re-created" local_time column:
    df2['local_time'] = df2.index.tz_convert('Europe/Berlin').strftime('%Y-%m-%d %H:%M:%S').values
    

    that should give you something like

    df2
                                value           local_time
    2017-03-25 23:00:00+00:00  2016.0  2017-03-26 00:00:00
    2017-03-26 00:00:00+00:00  2017.0  2017-03-26 01:00:00
    2017-03-26 01:00:00+00:00  2018.0  2017-03-26 03:00:00
    2017-03-26 02:00:00+00:00  2019.0  2017-03-26 04:00:00
    2017-03-26 03:00:00+00:00     NaN  2017-03-26 05:00:00
                              ...                  ...
    2017-10-29 00:00:00+00:00     NaN  2017-10-29 02:00:00
    2017-10-29 01:00:00+00:00  7224.0  2017-10-29 02:00:00 # note: value randomly attributed to "second" 2 am
    2017-10-29 02:00:00+00:00     NaN  2017-10-29 03:00:00
    2017-10-29 03:00:00+00:00  7226.0  2017-10-29 04:00:00
    2017-10-29 04:00:00+00:00  7227.0  2017-10-29 05:00:00
    

    As commented above, the value 7224 is attributed to 2017-10-29 01:00:00 UTC, but it could as well be attributed to 2017-10-29 00:00:00 UTC If you don't care, you're fine. If that's a problem, best you can do in my opinion is discard the value. You can do so by using

    df['local_time'] = pd.to_datetime(df['local_time']).dt.tz_localize('Europe/Berlin', ambiguous='NaT')
    

    instead of the native Python part in the code above.