I have two time series file which is meant to be in CET / CEST. The bad one of them, does not write the values in a proper way. For the good csv, see here:
#test_good.csv
local_time,value
...
2017-03-26 00:00,2016
2017-03-26 01:00,2017
2017-03-26 03:00,2018
2017-03-26 04:00,2019
...
2017-10-29 01:00,7224
2017-10-29 02:00,7225
2017-10-29 02:00,7226
2017-10-29 03:00,7227
...
...everything works fine by using:
df['utc_time'] = pd.to_datetime(df[local_time_column])
.dt.tz_localize('CET', ambiguous="infer")
.dt.tz_convert('UTC').dt.strftime('%Y-%m-%d %H:%M:%S')
When converting the test_bad.csv to UTC I get an AmbiguousTimeError as the 2 hours in October are missing.
# test_bad.csv
local_time,value
...
2017-03-26 00:00,2016
2017-03-26 01:00,2017 # everything is as it should be
2017-03-26 03:00,2018
2017-03-26 04:00,2019
...
2017-10-29 01:00,7223
2017-10-29 02:00,7224 # the value of 2 am should actually be repeated PLUS 3 am is missing
2017-10-29 04:00,7226
2017-10-29 05:00,7227
...
Does anyone know an elegant way of how to still convert the time series file to UTC and add NaN columns for the missing dates in the new index? Thanks for your help.
Elaborating a bit on Mark Ransom's comment;
2017-10-29 02:00,7224
is ambiguous; it could be 2017-10-29 00:00 UTC
or 2017-10-29 01:00 UTC
. That's why pd.to_datetime refuses to infer anything.
With the help of some native Python, you can work around. Assuming you just loaded the csv to a df without parsing anything to datetime, you can continue like
from datetime import datetime
import pytz
df['local_time'] = [pytz.timezone('Europe/Berlin').localize(datetime.fromisoformat(t)) for t in df['local_time']]
# so you can make a UTC index:
df.set_index(df['local_time'].dt.tz_convert('UTC'), inplace=True)
# Now you can create a new, hourly index from that and re-index:
dti = pd.date_range(df.index[0], df.index[-1], freq='H')
df2 = df.reindex(dti)
# for comparison, the "re-created" local_time column:
df2['local_time'] = df2.index.tz_convert('Europe/Berlin').strftime('%Y-%m-%d %H:%M:%S').values
that should give you something like
df2
value local_time
2017-03-25 23:00:00+00:00 2016.0 2017-03-26 00:00:00
2017-03-26 00:00:00+00:00 2017.0 2017-03-26 01:00:00
2017-03-26 01:00:00+00:00 2018.0 2017-03-26 03:00:00
2017-03-26 02:00:00+00:00 2019.0 2017-03-26 04:00:00
2017-03-26 03:00:00+00:00 NaN 2017-03-26 05:00:00
... ...
2017-10-29 00:00:00+00:00 NaN 2017-10-29 02:00:00
2017-10-29 01:00:00+00:00 7224.0 2017-10-29 02:00:00 # note: value randomly attributed to "second" 2 am
2017-10-29 02:00:00+00:00 NaN 2017-10-29 03:00:00
2017-10-29 03:00:00+00:00 7226.0 2017-10-29 04:00:00
2017-10-29 04:00:00+00:00 7227.0 2017-10-29 05:00:00
As commented above, the value 7224
is attributed to 2017-10-29 01:00:00 UTC
, but it could as well be attributed to 2017-10-29 00:00:00 UTC
If you don't care, you're fine. If that's a problem, best you can do in my opinion is discard the value. You can do so by using
df['local_time'] = pd.to_datetime(df['local_time']).dt.tz_localize('Europe/Berlin', ambiguous='NaT')
instead of the native Python part in the code above.