python pandas datetime pytz datetimeoffset

TimeSeries conversion from CET / CEST to UTC

I have two time series file which is meant to be in CET / CEST. The bad one of them, does not write the values in a proper way. For the good csv, see here:

#test_good.csv
local_time,value
...
2017-03-26 00:00,2016
2017-03-26 01:00,2017
2017-03-26 03:00,2018
2017-03-26 04:00,2019
...
2017-10-29 01:00,7224
2017-10-29 02:00,7225
2017-10-29 02:00,7226
2017-10-29 03:00,7227
...

...everything works fine by using:

        df['utc_time'] = pd.to_datetime(df[local_time_column])
                            .dt.tz_localize('CET', ambiguous="infer")
                            .dt.tz_convert('UTC').dt.strftime('%Y-%m-%d %H:%M:%S')

When converting the test_bad.csv to UTC I get an AmbiguousTimeError as the 2 hours in October are missing.

# test_bad.csv
local_time,value
...
2017-03-26 00:00,2016
2017-03-26 01:00,2017   # everything is as it should be
2017-03-26 03:00,2018
2017-03-26 04:00,2019
...
2017-10-29 01:00,7223
2017-10-29 02:00,7224   # the value of 2 am should actually be repeated PLUS 3 am is missing
2017-10-29 04:00,7226
2017-10-29 05:00,7227
...

Does anyone know an elegant way of how to still convert the time series file to UTC and add NaN columns for the missing dates in the new index? Thanks for your help.

Solution

Elaborating a bit on Mark Ransom's comment;

2017-10-29 02:00,7224

is ambiguous; it could be 2017-10-29 00:00 UTC or 2017-10-29 01:00 UTC. That's why pd.to_datetime refuses to infer anything.

With the help of some native Python, you can work around. Assuming you just loaded the csv to a df without parsing anything to datetime, you can continue like

from datetime import datetime
import pytz

df['local_time'] = [pytz.timezone('Europe/Berlin').localize(datetime.fromisoformat(t)) for t in df['local_time']]

# so you can make a UTC index:
df.set_index(df['local_time'].dt.tz_convert('UTC'), inplace=True)

# Now you can create a new, hourly index from that and re-index:
dti = pd.date_range(df.index[0], df.index[-1], freq='H')
df2 = df.reindex(dti)

# for comparison, the "re-created" local_time column:
df2['local_time'] = df2.index.tz_convert('Europe/Berlin').strftime('%Y-%m-%d %H:%M:%S').values

that should give you something like

df2
                            value           local_time
2017-03-25 23:00:00+00:00  2016.0  2017-03-26 00:00:00
2017-03-26 00:00:00+00:00  2017.0  2017-03-26 01:00:00
2017-03-26 01:00:00+00:00  2018.0  2017-03-26 03:00:00
2017-03-26 02:00:00+00:00  2019.0  2017-03-26 04:00:00
2017-03-26 03:00:00+00:00     NaN  2017-03-26 05:00:00
                          ...                  ...
2017-10-29 00:00:00+00:00     NaN  2017-10-29 02:00:00
2017-10-29 01:00:00+00:00  7224.0  2017-10-29 02:00:00 # note: value randomly attributed to "second" 2 am
2017-10-29 02:00:00+00:00     NaN  2017-10-29 03:00:00
2017-10-29 03:00:00+00:00  7226.0  2017-10-29 04:00:00
2017-10-29 04:00:00+00:00  7227.0  2017-10-29 05:00:00

As commented above, the value 7224 is attributed to 2017-10-29 01:00:00 UTC, but it could as well be attributed to 2017-10-29 00:00:00 UTC If you don't care, you're fine. If that's a problem, best you can do in my opinion is discard the value. You can do so by using

df['local_time'] = pd.to_datetime(df['local_time']).dt.tz_localize('Europe/Berlin', ambiguous='NaT')

instead of the native Python part in the code above.