Search code examples
pythonarraysnumpydatetimepytz

Efficiently localize an array of datetimes with pytz


What is the most efficient way of converting an array of naive datetime.datetime objects to an array of timezone-aware datetime objects?

Currently I have them in a numpy array. The answer doesn't necessarily need to end up as a numpy array, but should consider starting as one.

e.g. if I have this:

import numpy as np
import pytz
from datetime import datetime

# Time zone information
timezone = pytz.FixedOffset(-480)

# Numpy array of datetime objects
datetimes = np.array([datetime(2022, 1, 1, 12, 0, 0), datetime(2022, 1, 2, 12, 0, 0)])

How can I make datetimes timezone-aware?

Obviously list comprehension could work, but for large arrays it doesn't seem like it is as efficient as it could be. I would like a vectorized operation.

ChatGPT told me this would work (spoiler alert, it doesn't)

# Add time zone information to each datetime object
datetimes_with_timezone = timezone.localize(datetimes, is_dst=None)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\Nick\Anaconda3\envs\pftools\lib\site-packages\pytz\tzinfo.py", line 317, in localize
    if dt.tzinfo is not None:
AttributeError: 'numpy.ndarray' object has no attribute 'tzinfo'

Solution

  • If you want to work with pandas anyway further on, it might bring some benefits. Here are a few options with %timeits for relative comparison:

    import numpy as np
    import pandas as pd
    from datetime import datetime, timezone, timedelta
    
    # Time zone information
    tz = timezone(timedelta(minutes=-480))
    
    # Numpy array of datetime objects, lets make it 1 day, second resolution
    dt_array = np.array([datetime(2022, 1, 1) + timedelta(seconds=i) for i in range(86400)])
    
    # convert array to Series, then set tz:
    dt_series = pd.Series(dt_array).dt.tz_localize(tz)
    # %timeit pd.Series(dt_array).dt.tz_localize(tz)
    # 12.1 ms ± 22.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    # you can also get a numpy array back:
    dt_series = pd.Series(dt_array).dt.tz_localize(tz).to_numpy()
    # %timeit pd.Series(dt_array).dt.tz_localize(tz).to_numpy()
    # 188 ms ± 717 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
    
    # good old list comp:
    dt_list = [d.replace(tzinfo=tz) for d in dt_array]
    # %timeit [d.replace(tzinfo=tz) for d in dt_array]
    # 93.3 ms ± 2.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    
    # might also be put into a numpy array:
    dt_array_tz = np.array([d.replace(tzinfo=tz) for d in dt_array])
    # %timeit np.array([d.replace(tzinfo=tz) for d in dt_array])
    # 212 ms ± 2.61 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    Besides, pytz.timezone.localize is a tiny bit slower (pytz is deprecated btw.):

    import pytz
    tz = pytz.FixedOffset(-480)
    
    dt_list = [tz.localize(d) for d in dt_array]
    # %timeit [tz.localize(d) for d in dt_array]
    # 105 ms ± 436 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)