Search code examples
pythondatetimetimezoneutcpytz

How can I speed up this UTC conversion process?


I would like to convert a range of Datetime's to UTC timezone. The following code takes more than three minutes for 500_000 entries.

How can I speed up this process?

import datetime
from pytz import timezone
import pytz
import pandas as pd
import time
abc = pd.date_range(start='2020-03-28 05:00:00', periods=500_000, freq='5min')
UTC = pytz.timezone('UTC')
BERLIN = pytz.timezone('Europe/Berlin')

print("abc[0]=\n", abc[0])
print("abc[-1]=\n", abc[-1])

myList = []
my_time = time.time()
for runner in abc:
    localizedToBerlin = BERLIN.localize(runner)
    localizedToBerlinAsUtc = localizedToBerlin.astimezone(UTC)
    myList.append([runner, localizedToBerlinAsUtc])
print('runtime:', time.time() - my_time)

results in:

abc[0]=
 2020-03-28 05:00:00
abc[-1]=
 2024-12-28 07:35:00
runtime: 209.57262253761292

Solution

  • pandas built-in - if you work with/in pandas, try to avoid loops and use the built-ins, e.g. tz_convert. From Europe/Berlin to UTC:

    import pandas as pd
    dr = pd.date_range(start='2020-03-28 05:00:00', periods=500_000, freq='5min',
                       tz='Europe/Berlin')
    
    %timeit dr.tz_convert('UTC')
    77.2 µs ± 1.4 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    

    Localization from naive to Europe/Berlin and then to UTC:

    dr = pd.date_range(start='2020-03-28 05:00:00', periods=500_000, freq='5min')
    
    %timeit dr.tz_localize('Europe/Berlin', nonexistent='NaT', ambiguous='NaT').tz_convert('UTC')
    69.5 ms ± 191 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
    

    UTC first - Also note that it is much faster to localize naive to UTC and then convert to another timezone - UTC localization involves no computation of DST changes etc.

    dr = pd.date_range(start='2020-03-28 05:00:00', periods=500_000, freq='5min')
    
    %timeit dr.tz_localize('UTC').tz_convert('Europe/Berlin')
    173 µs ± 2.51 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    

    Working with lists - if you're not working with pandas data structures or similar and have to use lists, localization to UTC and then to another timezone still performs (relatively) ok:

    import pytz
    l = dr.to_list()
    
    l_utc = list(map(pytz.utc.localize, l))
    # %timeit list(map(pytz.utc.localize, l))
    # 1.44 s ± 7.72 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    cet = pytz.timezone('Europe/Berlin') # CEST at the moment
    l_cet = list(map(lambda t: t.astimezone(cet), l_utc))
    # %timeit list(map(lambda t: t.astimezone(cet), l_utc))
    # 3.24 s ± 10.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    Going directly from naive to a certain timezone is still a pain with pytz:

    %timeit list(map(cet.localize, l))
    2min 9s ± 7.31 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    dateutil vs. pytz - An alternative here would be to use dateutil - since it uses the same time zone model as Python, you can use replace():

    import dateutil
    d_cet = dateutil.tz.gettz('Europe/Berlin')
    
    %timeit [t.replace(tzinfo=d_cet) for t in l]
    5.67 s ± 357 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)