I'm doing some number-crunching in a docker container. On an 8 cpu machine the dockerized execution is around 28x slower than the host? I've examined:
import numpy ; numpy.show_config()
shows identical result in the container and the hostos.cpu_count
reports the same in the container as the host.The program uses fastdtw which uses numpy internally. Here is a minimal driver:
#fdtw.py
# Runs in 1s on host
# Runs in 28s in Docker
import numpy as np
from fastdtw import fastdtw
import time
a = np.sin(np.arange(1000))
b = np.cos(np.arange(3000))
t = time.time()
for i in range(100):
fastdtw(a,b)
print(time.time()-t)
The numpy setup and test execution in Docker. A lot of the expensive calls are implement in cython (https://github.com/slaypni/fastdtw/tree/master/fastdtw). Some not being compiled and optimized in the Docker case? How?
FROM python:3.10-slim
RUN apt update
RUN apt-get -y install nano build-essential software-properties-common libpng-dev
RUN apt-get -y install libopenblas-dev libopenblas64-0-openmp
RUN apt-get -y install gfortran liblapack3 liblapack-dev
# libatlas-base-dev libatlas-base-dev
# libblas3 libblas-dev
RUN pip3 install numpy==1.22.4 fastdtw
COPY /server /
RUN python -c 'import numpy ; print(numpy.__version__)'
RUN python -c 'import numpy ; numpy.show_config()'
RUN python -m cProfile -s cumtime /server/fdtw.py > log.txt
RUN cat log.txt | head -500
RUN exit 1
The docker profile output:
#12 [ 8/36] RUN python -c 'import numpy ; print(numpy.__version__)'
#12 0.427 1.22.4
#12 DONE 0.5s
#13 [ 9/36] RUN python -c 'import numpy ; numpy.show_config()'
#13 0.611 openblas64__info:
#13 0.611 libraries = ['openblas64_', 'openblas64_']
#13 0.611 library_dirs = ['/usr/local/lib']
#13 0.611 language = c
#13 0.611 define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None)]
#13 0.611 runtime_library_dirs = ['/usr/local/lib']
#13 0.611 blas_ilp64_opt_info:
#13 0.611 libraries = ['openblas64_', 'openblas64_']
#13 0.611 library_dirs = ['/usr/local/lib']
#13 0.611 language = c
#13 0.611 define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None)]
#13 0.611 runtime_library_dirs = ['/usr/local/lib']
#13 0.611 openblas64__lapack_info:
#13 0.611 libraries = ['openblas64_', 'openblas64_']
#13 0.611 library_dirs = ['/usr/local/lib']
#13 0.611 language = c
#13 0.611 define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None), ('HAVE_LAPACKE', None)]
#13 0.611 runtime_library_dirs = ['/usr/local/lib']
#13 0.611 lapack_ilp64_opt_info:
#13 0.611 libraries = ['openblas64_', 'openblas64_']
#13 0.611 library_dirs = ['/usr/local/lib']
#13 0.611 language = c
#13 0.611 define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None), ('HAVE_LAPACKE', None)]
#13 0.611 runtime_library_dirs = ['/usr/local/lib']
#13 0.611 Supported SIMD extensions in this NumPy install:
#13 0.611 baseline = SSE,SSE2,SSE3
#13 0.611 found = SSSE3,SSE41,POPCNT,SSE42,AVX,F16C,FMA3,AVX2
#13 0.611 not found = AVX512F,AVX512CD,AVX512_KNL,AVX512_KNM,AVX512_SKX,AVX512_CLX,AVX512_CNL,AVX512_ICL
#13 DONE 0.7s
#14 [10/36] RUN python -m cProfile -s cumtime /server/fdtw.py > log.txt
#14 DONE 28.5s
#15 [11/36] RUN cat log.txt | head -500
#15 0.320 27.874674558639526
#15 0.320 46359783 function calls (46356877 primitive calls) in 28.046 seconds
#15 0.320
#15 0.320 Ordered by: cumulative time
#15 0.320
#15 0.320 ncalls tottime percall cumtime percall filename:lineno(function)
#15 0.320 112/1 0.000 0.000 28.046 28.046 {built-in method builtins.exec}
#15 0.320 1 0.010 0.010 28.046 28.046 fdtw.py:1(<module>)
#15 0.320 100 0.063 0.001 27.865 0.279 fastdtw.py:15(fastdtw)
#15 0.320 1000/100 0.899 0.001 27.800 0.278 fastdtw.py:64(__fastdtw)
#15 0.320 1000 9.697 0.010 18.775 0.019 fastdtw.py:133(__dtw)
#15 0.320 900 5.837 0.006 7.933 0.009 fastdtw.py:157(__expand_window)
#15 0.320 4349147 4.383 0.000 5.852 0.000 {built-in method builtins.min}
#15 0.320 4348100 1.212 0.000 1.711 0.000 fastdtw.py:56(__difference)
#15 0.320 13044300 1.469 0.000 1.469 0.000 fastdtw.py:143(<lambda>)
#15 0.320 4349100 1.176 0.000 1.176 0.000 fastdtw.py:137(<genexpr>)
#15 0.320 7089966 0.923 0.000 0.923 0.000 {method 'add' of 'set' objects}
#15 0.320 2991000 0.849 0.000 0.849 0.000 fastdtw.py:160(<genexpr>)
#15 0.320 4348196 0.499 0.000 0.499 0.000 {built-in method builtins.abs}
#15 0.320 14 0.001 0.000 0.373 0.027 __init__.py:1(<module>)
#15 0.320 4949989 0.372 0.000 0.372 0.000 {method 'append' of 'list' objects}
#15 0.320 798400 0.290 0.000 0.290 0.000 fastdtw.py:138(<lambda>)
#15 0.320 1800 0.003 0.000 0.191 0.000 fastdtw.py:153(__reduce_by_half)
#15 0.320 1800 0.188 0.000 0.188 0.000 fastdtw.py:154(<listcomp>)
... minor costs follow ...
The same tests on the host:
$ python -c 'import numpy ; print(numpy.__version__)'
1.22.4
$ python -c 'import numpy ; numpy.show_config()'
openblas64__info:
libraries = ['openblas64_', 'openblas64_']
library_dirs = ['/usr/local/lib']
language = c
define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None)]
runtime_library_dirs = ['/usr/local/lib']
blas_ilp64_opt_info:
libraries = ['openblas64_', 'openblas64_']
library_dirs = ['/usr/local/lib']
language = c
define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None)]
runtime_library_dirs = ['/usr/local/lib']
openblas64__lapack_info:
libraries = ['openblas64_', 'openblas64_']
library_dirs = ['/usr/local/lib']
language = c
define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None), ('HAVE_LAPACKE', None)]
runtime_library_dirs = ['/usr/local/lib']
lapack_ilp64_opt_info:
libraries = ['openblas64_', 'openblas64_']
library_dirs = ['/usr/local/lib']
language = c
define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None), ('HAVE_LAPACKE', None)]
runtime_library_dirs = ['/usr/local/lib']
Supported SIMD extensions in this NumPy install:
baseline = SSE,SSE2,SSE3
found = SSSE3,SSE41,POPCNT,SSE42,AVX,F16C,FMA3,AVX2
not found = AVX512F,AVX512CD,AVX512_KNL,AVX512_KNM,AVX512_SKX,AVX512_CLX,AVX512_CNL,AVX512_ICL
$ python -m cProfile -s cumtime fdtw.py > log.txt
$ cat log.txt | head -500
0.1275956630706787
90773 function calls (88792 primitive calls) in 0.286 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
14 0.001 0.000 0.336 0.024 __init__.py:1(<module>)
111/1 0.000 0.000 0.286 0.286 {built-in method builtins.exec}
1 0.006 0.006 0.286 0.286 fdtw.py:1(<module>)
151/2 0.001 0.000 0.159 0.079 <frozen importlib._bootstrap>:1022(_find_and_load)
151/2 0.001 0.000 0.159 0.079 <frozen importlib._bootstrap>:987(_find_and_load_unlocked)
139/2 0.001 0.000 0.158 0.079 <frozen importlib._bootstrap>:664(_load_unlocked)
110/2 0.000 0.000 0.158 0.079 <frozen importlib._bootstrap_external>:877(exec_module)
217/2 0.000 0.000 0.158 0.079 <frozen importlib._bootstrap>:233(_call_with_frames_removed)
180/16 0.001 0.000 0.154 0.010 <frozen importlib._bootstrap>:1053(_handle_fromlist)
362/8 0.001 0.000 0.153 0.019 {built-in method builtins.__import__}
100 0.117 0.001 0.122 0.001 {fastdtw._fastdtw.fastdtw}
110 0.001 0.000 0.041 0.000 <frozen importlib._bootstrap_external>:950(get_code)
1 0.000 0.000 0.029 0.029 multiarray.py:1(<module>)
1 0.000 0.000 0.028 0.028 numeric.py:1(<module>)
1 0.000 0.000 0.028 0.028 overrides.py:1(<module>)
316 0.002 0.000 0.022 0.000 overrides.py:170(decorator)
110 0.000 0.000 0.021 0.000 <frozen importlib._bootstrap_external>:670(_compile_bytecode)
110 0.020 0.000 0.020 0.000 {built-in method marshal.loads}
139/136 0.000 0.000 0.019 0.000 <frozen importlib._bootstrap>:564(module_from_spec)
148 0.001 0.000 0.018 0.000 <frozen importlib._bootstrap>:921(_find_spec)
2 0.000 0.000 0.017 0.008 shape_base.py:1(<module>)
286 0.001 0.000 0.017 0.000 overrides.py:88(verify_matching_signatures)
1 0.000 0.000 0.016 0.016 py3k.py:1(<module>)
110 0.010 0.000 0.016 0.000 <frozen importlib._bootstrap_external>:1070(get_data)
136 0.000 0.000 0.015 0.000 <frozen importlib._bootstrap_external>:1431(find_spec)
136 0.001 0.000 0.015 0.000 <frozen importlib._bootstrap_external>:1399(_get_spec)
1 0.000 0.000 0.015 0.015 fromnumeric.py:1(<module>)
622 0.001 0.000 0.015 0.000 _inspect.py:96(getargspec)
17 0.000 0.000 0.014 0.001 <frozen importlib._bootstrap_external>:1174(create_module)
17 0.011 0.001 0.014 0.001 {built-in method _imp.create_dynamic}
289 0.003 0.000 0.013 0.000 <frozen importlib._bootstrap_external>:1536(find_spec)
151 0.000 0.000 0.012 0.000 <frozen importlib._bootstrap>:169(__enter__)
454 0.001 0.000 0.012 0.000 <frozen importlib._bootstrap>:179(_get_module_lock)
622 0.011 0.000 0.011 0.000 _inspect.py:26(isfunction)
150 0.010 0.000 0.010 0.000 <frozen importlib._bootstrap>:71(__init__)
1 0.000 0.000 0.010 0.010 _add_newdocs_scalars.py:1(<module>)
1 0.000 0.000 0.010 0.010 _pickle.py:1(<module>)
150 0.000 0.000 0.009 0.000 re.py:288(_compile)
1 0.000 0.000 0.009 0.009 platform.py:1(<module>)
17/12 0.000 0.000 0.009 0.001 <frozen importlib._bootstrap_external>:1182(exec_module)
17/12 0.003 0.000 0.009 0.001 {built-in method _imp.exec_dynamic}
1 0.000 0.000 0.008 0.008 pathlib.py:1(<module>)
28 0.000 0.000 0.008 0.000 sre_compile.py:783(compile)
25 0.000 0.000 0.008 0.000 re.py:249(compile)
219/218 0.004 0.000 0.007 0.000 {built-in method builtins.__build_class__}
1 0.000 0.000 0.006 0.006 index_tricks.py:1(<module>)
1 0.000 0.000 0.005 0.005 _add_newdocs.py:1(<module>)
313 0.001 0.000 0.005 0.000 function_base.py:475(add_newdoc)
28 0.000 0.000 0.005 0.000 sre_parse.py:944(parse)
1501 0.002 0.000 0.004 0.000 <frozen importlib._bootstrap_external>:126(_path_join)
1 0.000 0.000 0.004 0.004 secrets.py:1(<module>)
139 0.001 0.000 0.004 0.000 <frozen importlib._bootstrap>:492(_init_module_attrs)
77/28 0.000 0.000 0.004 0.000 sre_parse.py:436(_parse_sub)
1 0.000 0.000 0.004 0.004 numerictypes.py:1(<module>)
82/30 0.002 0.000 0.004 0.000 sre_parse.py:494(_parse)
1 0.000 0.000 0.004 0.004 pickle.py:1(<module>)
1 0.000 0.000 0.004 0.004 subprocess.py:1(<module>)
596 0.000 0.000 0.004 0.000 <frozen importlib._bootstrap_external>:140(_path_stat)
2000 0.002 0.000 0.004 0.000 numerictypes.py:356(issubdtype)
1 0.000 0.000 0.003 0.003 ntpath.py:1(<module>)
596 0.003 0.000 0.003 0.000 {built-in method posix.stat}
28 0.000 0.000 0.003 0.000 sre_compile.py:622(_code)
1 0.000 0.000 0.003 0.003 version.py:1(<module>)
326 0.002 0.000 0.003 0.000 functools.py:35(update_wrapper)
220 0.001 0.000 0.003 0.000 <frozen importlib._bootstrap_external>:380(cache_from_source)
1 0.000 0.000 0.003 0.003 defmatrix.py:1(<module>)
1 0.000 0.000 0.003 0.003 defchararray.py:1(<module>)
303 0.001 0.000 0.003 0.000 <frozen importlib._bootstrap>:216(_lock_unlock_module)
2 0.000 0.000 0.003 0.001 _version.py:1(<module>)
1 0.000 0.000 0.003 0.003 hmac.py:1(<module>)
... minor costs follow
FastDTW contains both a fast path implemented in Cython, and a slow path implemented in Python. It tries the fast path first, and if it gets an ImportError, it falls back to the slow path. You are probably hitting the fast path in the host, and the slow path in the container.
Some evidence for this idea:
builtins.min()
appears in the slow profile. While both the fast path and slow path use min()
, Cython compiles this into C code which makes no use of the Python min()
. Therefore, the slower profile must be using the slow path in FastDTW.Source code where this feature is implemented:
You can use the following code to check this:
import inspect
import fastdtw
fastdtw.dtw
inspect.getfile(fastdtw.dtw)
Output, slow path:
Python 3.10.10 (main, Jun 25 2023, 11:16:46) [Clang 14.0.3 (clang-1403.0.22.14.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import inspect
>>> import fastdtw
>>> fastdtw.dtw
<function dtw at 0x10f9441f0>
>>> inspect.getfile(fastdtw.dtw)
'/Users/nick/.pyenv/versions/3.10.10/lib/python3.10/site-packages/fastdtw/fastdtw.py'
Output, fast path:
Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import inspect
>>> import fastdtw
>>> fastdtw.dtw
<built-in function dtw>
>>> inspect.getfile(fastdtw.dtw)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.10/inspect.py", line 797, in getfile
raise TypeError('module, class, method, function, traceback, frame, or '
TypeError: module, class, method, function, traceback, frame, or code object was expected, got builtin_function_or_method
Timing of slow path: 16.39020 seconds
Timing of fast path: 0.09339 seconds
(Note: slow path and fast path benchmark are run on different hardware.)