Trying to get the unixtimestamp
from millions of bytes
objects
Using this
import datetime
dt_bytes = b'2019-05-23 09:37:56.362965'
#fmt = '%m/%d/%Y %H:%M:%S.%f'
fmt = '%Y-%m-%d %H:%M:%S.%f'
dt_ts = datetime.datetime.strptime(dt_bytes.decode('utf-8'), fmt)
unix_ts = dt_ts.timestamp()
works perfect:
In [82]: unix_ts
Out[82]: 1558604276.362965
But the decode('utf-8')
is cutting the flow rate in half (from 38k/sec to 20k/sec).
So is there a way to get the unixtimestamp from a bytes
input instead of a str
input?
__UPDATE:__
I found out that the bottleneck is datetime.datetime.strptime(..)
, so I switched to np.datetime64
(see below)
__UPDATE 2:__ Check the accepted answer below to get a good performance benchmark of different approaches.
Let's first assume you have strings in ISO format, '%Y-%m-%dT%H:%M:%S.%f', in a list
(let's also not consider decoding from byte array for now):
from datetime import datetime, timedelta
base, n = datetime(2000, 1, 1, 1, 2, 3, 420001), 1000
datelist = [(base + timedelta(days=i)).isoformat(' ') for i in range(n)]
# datelist
# ['2000-01-01 01:02:03.420001'
# ...
# '2002-09-26 01:02:03.420001']
from string to datetime object
Let's define some functions that parse string to datetime
, using different methods:
import re
import numpy as np
def strp_isostr(l):
return list(map(datetime.fromisoformat, l))
def isostr_to_nparr(l):
return np.array(l, dtype=np.datetime64)
def split_isostr(l):
def splitter(s):
tmp = s.split(' ')
tmp = tmp[0].split('-') + [tmp[1]]
tmp = tmp[:3] + tmp[3].split(':')
tmp = tmp[:5] + tmp[5].split('.')
return datetime(*map(int, tmp))
return list(map(splitter, l))
def resplit_isostr(l):
# return list(map(lambda s: datetime(*map(int, re.split('T|-|\:|\.', s))), l))
return [datetime(*map(int, re.split('\ |-|\:|\.', s))) for s in l]
def full_stptime(l):
# return list(map(lambda s: datetime.strptime(s, '%Y-%m-%dT%H:%M:%S.%f'), l))
return [datetime.strptime(s, '%Y-%m-%d %H:%M:%S.%f') for s in l]
If I run %timeit
in the IPython console for these functions on my machine, I get
%timeit strp_isostr(datelist)
98.2 µs ± 766 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit isostr_to_nparr(datelist)
1.49 ms ± 13.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit split_isostr(datelist)
3.02 ms ± 236 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit resplit_isostr(datelist)
3.8 ms ± 256 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit full_stptime(datelist)
16.7 ms ± 780 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
So we can conclude that the built-in datetime.fromisoformat
is by far the fastest option for the 1000-element input. However, this assumes you want a list
to work with. In case you need an np.array
of datetime64
anyway, going straight to that seems like the best option.
third party option: ciso8601
If you're able to install additional packages, ciso8601
is worth a look:
import ciso8601
def ciso(l):
return list(map(ciso8601.parse_datetime, l))
%timeit ciso(datelist)
138 µs ± 1.83 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
from datetime object to seconds since the epoch
Looking at the conversion from datetime
object to POSIX timestamp, using the most obvious datetime.timestamp
method seems to be the most efficient:
import time
def dt_ts(l):
return list(map(datetime.timestamp, l))
def timetup(l):
return list(map(time.mktime, map(datetime.timetuple, l)))
%timeit dt_ts(strp_isostr(datelist))
572 µs ± 4.57 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit timetup(strp_isostr(datelist))
1.44 ms ± 15.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)