I am trying to join two pandas dataframes with the pd.DataFrame.join function, but there is a problem when I try to change the datetime of an existing index. I used:
import pandas as pd
import pytz
import numpy as np
# Creating date ranges
sim_date = pd.date_range('1980-01-01', '1980-12-31', freq='1H', tz='UTC')
obs_date = pd.date_range('1980-01-01', '1980-12-31', freq='1H')
# Creating DataFrames with the time ranges as indices
sim = pd.DataFrame(np.random.rand(len(interim_date)), index=interim_date, columns=['Sim'])
obs = pd.DataFrame(np.random.rand(len(obs_date)), index=obs_date, columns=['Obs'])
# Changing the timezone of the observed data index
obs.index = obs.index.tz_localize('UTC').tz_convert('America/Phoenix')
# Printing the result of the join
print(pd.DataFrame.join(sim, obs).dropna())
Which yields:
Sim Obs
1980-01-01 00:00:00+00:00 0.844345 0.117649
1980-01-01 01:00:00+00:00 0.505349 0.755907
1980-01-01 02:00:00+00:00 0.799555 0.169102
1980-01-01 03:00:00+00:00 0.194750 0.704400
1980-01-01 04:00:00+00:00 0.459079 0.241803
1980-01-01 05:00:00+00:00 0.496936 0.726264
1980-01-01 06:00:00+00:00 0.515039 0.989569
1980-01-01 07:00:00+00:00 0.271105 0.488859
1980-01-01 08:00:00+00:00 0.545269 0.434904
1980-01-01 09:00:00+00:00 0.817365 0.067979
1980-01-01 10:00:00+00:00 0.051024 0.068993
1980-01-01 11:00:00+00:00 0.170346 0.510406
1980-01-01 12:00:00+00:00 0.518609 0.583602
1980-01-01 13:00:00+00:00 0.725753 0.402805
1980-01-01 14:00:00+00:00 0.134059 0.879183
1980-01-01 15:00:00+00:00 0.304070 0.773884
1980-01-01 16:00:00+00:00 0.742448 0.158367
1980-01-01 17:00:00+00:00 0.539499 0.067725
1980-01-01 18:00:00+00:00 0.349432 0.027337
1980-01-01 19:00:00+00:00 0.549015 0.078190
1980-01-01 20:00:00+00:00 0.089871 0.878931
1980-01-01 21:00:00+00:00 0.100849 0.359007
1980-01-01 22:00:00+00:00 0.290280 0.168759
1980-01-01 23:00:00+00:00 0.074420 0.881724
1980-01-02 00:00:00+00:00 0.091413 0.820616
but when I use this:
import pandas as pd
import pytz
import numpy as np
interim_date = pd.date_range('1980-01-01', '1980-01-02', freq='1H', tz='UTC')
obs_date = pd.date_range('1980-01-01', '1980-01-02', freq='1H', tz='America/Phoenix')
sim = pd.DataFrame(np.random.rand(len(interim_date)), index=interim_date, columns=['Sim'])
obs = pd.DataFrame(np.random.rand(len(obs_date)), index=obs_date, columns=['Obs'])
print(pd.DataFrame.join(sim, obs).dropna())
it yields:
Sim Obs
1980-01-01 07:00:00+00:00 0.894766 0.509333
1980-01-01 08:00:00+00:00 0.805764 0.564251
1980-01-01 09:00:00+00:00 0.996807 0.856853
1980-01-01 10:00:00+00:00 0.494817 0.088286
1980-01-01 11:00:00+00:00 0.716468 0.947045
1980-01-01 12:00:00+00:00 0.808407 0.332764
1980-01-01 13:00:00+00:00 0.554688 0.959215
1980-01-01 14:00:00+00:00 0.389542 0.462384
1980-01-01 15:00:00+00:00 0.039566 0.850724
1980-01-01 16:00:00+00:00 0.634998 0.097579
1980-01-01 17:00:00+00:00 0.169957 0.390812
1980-01-01 18:00:00+00:00 0.113913 0.519487
1980-01-01 19:00:00+00:00 0.521354 0.260055
1980-01-01 20:00:00+00:00 0.910717 0.693063
1980-01-01 21:00:00+00:00 0.907878 0.190714
1980-01-01 22:00:00+00:00 0.625534 0.048584
1980-01-01 23:00:00+00:00 0.926966 0.815481
1980-01-02 00:00:00+00:00 0.841386 0.573255
It seems as though both methods should yield the same result, but it seems like in the first case the observed DataFrame index does not change, even though when I print it out it says that it changed... Any suggestions?
Your two methods of construction are actually different. For example, look at the first element of each index:
In [2]: obs_date1 = pd.date_range('1980-01-01', '1980-12-31', freq='1H') \
.tz_localize('UTC') \
.tz_convert('America/Phoenix')
In [3]: obs_date2 = pd.date_range('1980-01-01', '1980-12-31', freq='1H',
tz='America/Phoenix')
In [4]: obs_date1[0]
Out[4]: Timestamp('1979-12-31 17:00:00-0700', tz='America/Phoenix', freq='H')
In [5]: obs_date2[0]
Out[5]: Timestamp('1980-01-01 00:00:00-0700', tz='America/Phoenix', freq='H')
By doing .tz_localize('UTC')
, you're basically appending the UTC timezone to each timestamp. Then by using tz_convert('America/Phoenix')
you're shifting the UTC timestamps back seven hours to the America/Phoenix timezone. When you use this method and join
against the UTC timestamps, this index get shifted back to UTC for alignment, and both indexes exactly match, hence nothing getting dropped in your join
.
I think what you want for the first option is to just directly use tz_localize('America/Phoenix')
:
In [6]: obs_date3 = pd.date_range('1980-01-01', '1980-12-31', freq='1H') \
...: .tz_localize('America/Phoenix')
In [7]: obs_date3[0]
Out[7]: Timestamp('1980-01-01 00:00:00-0700', tz='America/Phoenix', freq='H')