Efficiently localize an array of datetimes with pytz

What is the most efficient way of converting an array of naive datetime.datetime objects to an array of timezone-aware datetime objects?

Currently I have them in a numpy array. The answer doesn't necessarily need to end up as a numpy array, but should consider starting as one.

e.g. if I have this:

import numpy as np
import pytz
from datetime import datetime

# Time zone information
timezone = pytz.FixedOffset(-480)

# Numpy array of datetime objects
datetimes = np.array([datetime(2022, 1, 1, 12, 0, 0), datetime(2022, 1, 2, 12, 0, 0)])

How can I make datetimes timezone-aware?

Obviously list comprehension could work, but for large arrays it doesn't seem like it is as efficient as it could be. I would like a vectorized operation.

ChatGPT told me this would work (spoiler alert, it doesn't)

# Add time zone information to each datetime object
datetimes_with_timezone = timezone.localize(datetimes, is_dst=None)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\Nick\Anaconda3\envs\pftools\lib\site-packages\pytz\tzinfo.py", line 317, in localize
    if dt.tzinfo is not None:
AttributeError: 'numpy.ndarray' object has no attribute 'tzinfo'

Solution

If you want to work with pandas anyway further on, it might bring some benefits. Here are a few options with %timeits for relative comparison:

import numpy as np
import pandas as pd
from datetime import datetime, timezone, timedelta

# Time zone information
tz = timezone(timedelta(minutes=-480))

# Numpy array of datetime objects, lets make it 1 day, second resolution
dt_array = np.array([datetime(2022, 1, 1) + timedelta(seconds=i) for i in range(86400)])

# convert array to Series, then set tz:
dt_series = pd.Series(dt_array).dt.tz_localize(tz)
# %timeit pd.Series(dt_array).dt.tz_localize(tz)
# 12.1 ms ± 22.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# you can also get a numpy array back:
dt_series = pd.Series(dt_array).dt.tz_localize(tz).to_numpy()
# %timeit pd.Series(dt_array).dt.tz_localize(tz).to_numpy()
# 188 ms ± 717 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# good old list comp:
dt_list = [d.replace(tzinfo=tz) for d in dt_array]
# %timeit [d.replace(tzinfo=tz) for d in dt_array]
# 93.3 ms ± 2.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# might also be put into a numpy array:
dt_array_tz = np.array([d.replace(tzinfo=tz) for d in dt_array])
# %timeit np.array([d.replace(tzinfo=tz) for d in dt_array])
# 212 ms ± 2.61 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Besides, pytz.timezone.localize is a tiny bit slower (pytz is deprecated btw.):

import pytz
tz = pytz.FixedOffset(-480)

dt_list = [tz.localize(d) for d in dt_array]
# %timeit [tz.localize(d) for d in dt_array]
# 105 ms ± 436 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)