Search code examples
pythonpandaspytz

Performance of timezone-aware Pandas DateTimeIndex


I searched online but found nothing on the problem I'm facing.

It seems that pandas.DataFrame operations on index with timezone-aware dates is order of magnitude slower than on regular datetimes.

here are the ipython timings.

first with standard datetimes :

import pandas as pd
import numpy as np

dates=pd.date_range('2010/01/01 00:00:00', '2010/12/31 00:00:00', freq='1T')
DF=pd.DataFrame(data=np.random.rand(len(dates)), index=dates, columns=["value"])

# compute timedeltas between dates
%timeit DF["temp"] = DF.index
%timeit DF["deltas"] = (DF["temp"] - DF["temp"].shift())

results are :

1000 loops, best of 3: 1.13 ms per loop
100 loops, best of 3: 17.1 ms per loop

so far, so good.

now just adding timezone information :

import pandas as pd
import numpy as np

dates=pd.date_range('2010/01/01 00:00:00', '2010/12/31 00:00:00', freq='1T')
# NEW: filter dates to avoid DST problems
dates=dates[dates.hour>2] # to avoid AmbiguousInferError or NonExistentDateError

DF=pd.DataFrame(data=np.random.rand(len(dates)), index=dates, columns=["value"])

# NEW: add timezone info
DF.index = DF.index.tz_localize(tz="America/New_York", ambiguous="infer")

# compute timedeltas between dates
%timeit DF["temp"] = DF.index
%timeit DF["deltas"] = (DF["temp"] - DF["temp"].shift())

and now, results are :

1 loops, best of 3: 5.43 s per loop
1 loops, best of 3: 16 s per loop

why is that ??
I really don't understand where is the bottleneck here...

for info (from conda list) :

anaconda                  2.2.0                np19py34_0  
conda                     3.12.0                   py34_0  

numpy                     1.9.2                    py34_0  
pandas                    0.16.1               np19py34_0  
pytz                      2015.4                   py34_0  
scipy                     0.15.1               np19py34_0  

Solution

  • This is a known issue, see here. Datetimes with a naive tz (e.g. NO timezone) Series are efficiently represented with a dtype of datetime64[ns]. Calculations using int64's and so are pretty fast. tz-aware Series are represented using object dtype. These calculations are quite a bit slower.

    It IS possible to fix this (see the referenced issue), to have a uniform tz-aware Series. Pull-requests are welcome!

    In [9]: df = DataFrame({'datetime' : pd.date_range('20130101',periods=5), 'datetime_with_tz' : pd.date_range('20130101',periods=5,tz='US/Eastern')})
    
    In [10]: df 
    Out[10]: 
        datetime           datetime_with_tz
    0 2013-01-01  2013-01-01 00:00:00-05:00
    1 2013-01-02  2013-01-02 00:00:00-05:00
    2 2013-01-03  2013-01-03 00:00:00-05:00
    3 2013-01-04  2013-01-04 00:00:00-05:00
    4 2013-01-05  2013-01-05 00:00:00-05:00
    
    In [11]: df.dtypes
    Out[11]: 
    datetime            datetime64[ns]
    datetime_with_tz            object
    dtype: object