Search code examples
pythondockernumpyoptimization

How to diagnose an 28x slowdown in containerized vs host python+numpy execution


I'm doing some number-crunching in a docker container. On an 8 cpu machine the dockerized execution is around 28x slower than the host? I've examined:

  • warm-up costs: I've tried running the test on the second execution in the same process (below the warmup costs appear negligble anyway)
  • numpy optimization options: import numpy ; numpy.show_config() shows identical result in the container and the host
  • cpus: os.cpu_count reports the same in the container as the host.
  • file IO: on the second run, there is no file IO. (The first run loads data, the second had the data cached in memory)
  • blas vs openblas installed
  • python version (both use python 3.10)
  • numpy version (1.22.4 for both)
  • fastdtw version (both same)

The program uses fastdtw which uses numpy internally. Here is a minimal driver:

#fdtw.py
# Runs in 1s on host
# Runs in 28s in Docker
import numpy as np
from fastdtw import fastdtw
import time

a = np.sin(np.arange(1000))
b = np.cos(np.arange(3000))

t = time.time()
for i in range(100):
    fastdtw(a,b)
print(time.time()-t)

The numpy setup and test execution in Docker. A lot of the expensive calls are implement in cython (https://github.com/slaypni/fastdtw/tree/master/fastdtw). Some not being compiled and optimized in the Docker case? How?

FROM python:3.10-slim
RUN apt update
RUN apt-get -y install nano build-essential software-properties-common libpng-dev
RUN apt-get -y install libopenblas-dev libopenblas64-0-openmp

RUN apt-get -y install gfortran liblapack3 liblapack-dev

#    libatlas-base-dev libatlas-base-dev
#     libblas3  libblas-dev

RUN pip3 install numpy==1.22.4 fastdtw

COPY /server /

RUN python -c 'import numpy ; print(numpy.__version__)'
RUN python -c 'import numpy ; numpy.show_config()'

RUN python -m cProfile  -s cumtime /server/fdtw.py > log.txt
RUN cat log.txt | head -500

RUN exit 1

The docker profile output:

#12 [ 8/36] RUN python -c 'import numpy ; print(numpy.__version__)'
#12 0.427 1.22.4
#12 DONE 0.5s

#13 [ 9/36] RUN python -c 'import numpy ; numpy.show_config()'
#13 0.611 openblas64__info:
#13 0.611     libraries = ['openblas64_', 'openblas64_']
#13 0.611     library_dirs = ['/usr/local/lib']
#13 0.611     language = c
#13 0.611     define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None)]
#13 0.611     runtime_library_dirs = ['/usr/local/lib']
#13 0.611 blas_ilp64_opt_info:
#13 0.611     libraries = ['openblas64_', 'openblas64_']
#13 0.611     library_dirs = ['/usr/local/lib']
#13 0.611     language = c
#13 0.611     define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None)]
#13 0.611     runtime_library_dirs = ['/usr/local/lib']
#13 0.611 openblas64__lapack_info:
#13 0.611     libraries = ['openblas64_', 'openblas64_']
#13 0.611     library_dirs = ['/usr/local/lib']
#13 0.611     language = c
#13 0.611     define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None), ('HAVE_LAPACKE', None)]
#13 0.611     runtime_library_dirs = ['/usr/local/lib']
#13 0.611 lapack_ilp64_opt_info:
#13 0.611     libraries = ['openblas64_', 'openblas64_']
#13 0.611     library_dirs = ['/usr/local/lib']
#13 0.611     language = c
#13 0.611     define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None), ('HAVE_LAPACKE', None)]
#13 0.611     runtime_library_dirs = ['/usr/local/lib']
#13 0.611 Supported SIMD extensions in this NumPy install:
#13 0.611     baseline = SSE,SSE2,SSE3
#13 0.611     found = SSSE3,SSE41,POPCNT,SSE42,AVX,F16C,FMA3,AVX2
#13 0.611     not found = AVX512F,AVX512CD,AVX512_KNL,AVX512_KNM,AVX512_SKX,AVX512_CLX,AVX512_CNL,AVX512_ICL
#13 DONE 0.7s

#14 [10/36] RUN python -m cProfile  -s cumtime /server/fdtw.py > log.txt
#14 DONE 28.5s



#15 [11/36] RUN cat log.txt | head -500
#15 0.320 27.874674558639526
#15 0.320          46359783 function calls (46356877 primitive calls) in 28.046 seconds
#15 0.320 
#15 0.320    Ordered by: cumulative time
#15 0.320 
#15 0.320    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
#15 0.320     112/1    0.000    0.000   28.046   28.046 {built-in method builtins.exec}
#15 0.320         1    0.010    0.010   28.046   28.046 fdtw.py:1(<module>)
#15 0.320       100    0.063    0.001   27.865    0.279 fastdtw.py:15(fastdtw)
#15 0.320  1000/100    0.899    0.001   27.800    0.278 fastdtw.py:64(__fastdtw)
#15 0.320      1000    9.697    0.010   18.775    0.019 fastdtw.py:133(__dtw)
#15 0.320       900    5.837    0.006    7.933    0.009 fastdtw.py:157(__expand_window)
#15 0.320   4349147    4.383    0.000    5.852    0.000 {built-in method builtins.min}
#15 0.320   4348100    1.212    0.000    1.711    0.000 fastdtw.py:56(__difference)
#15 0.320  13044300    1.469    0.000    1.469    0.000 fastdtw.py:143(<lambda>)
#15 0.320   4349100    1.176    0.000    1.176    0.000 fastdtw.py:137(<genexpr>)
#15 0.320   7089966    0.923    0.000    0.923    0.000 {method 'add' of 'set' objects}
#15 0.320   2991000    0.849    0.000    0.849    0.000 fastdtw.py:160(<genexpr>)
#15 0.320   4348196    0.499    0.000    0.499    0.000 {built-in method builtins.abs}
#15 0.320        14    0.001    0.000    0.373    0.027 __init__.py:1(<module>)
#15 0.320   4949989    0.372    0.000    0.372    0.000 {method 'append' of 'list' objects}
#15 0.320    798400    0.290    0.000    0.290    0.000 fastdtw.py:138(<lambda>)
#15 0.320      1800    0.003    0.000    0.191    0.000 fastdtw.py:153(__reduce_by_half)
#15 0.320      1800    0.188    0.000    0.188    0.000 fastdtw.py:154(<listcomp>)
... minor costs follow ...

The same tests on the host:

$ python -c 'import numpy ; print(numpy.__version__)'
1.22.4
$ python -c 'import numpy ; numpy.show_config()'
openblas64__info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None)]
    runtime_library_dirs = ['/usr/local/lib']
blas_ilp64_opt_info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None)]
    runtime_library_dirs = ['/usr/local/lib']
openblas64__lapack_info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None), ('HAVE_LAPACKE', None)]
    runtime_library_dirs = ['/usr/local/lib']
lapack_ilp64_opt_info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None), ('HAVE_LAPACKE', None)]
    runtime_library_dirs = ['/usr/local/lib']
Supported SIMD extensions in this NumPy install:
    baseline = SSE,SSE2,SSE3
    found = SSSE3,SSE41,POPCNT,SSE42,AVX,F16C,FMA3,AVX2
    not found = AVX512F,AVX512CD,AVX512_KNL,AVX512_KNM,AVX512_SKX,AVX512_CLX,AVX512_CNL,AVX512_ICL


$ python -m cProfile  -s cumtime fdtw.py > log.txt
$ cat log.txt | head -500
0.1275956630706787
         90773 function calls (88792 primitive calls) in 0.286 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       14    0.001    0.000    0.336    0.024 __init__.py:1(<module>)
    111/1    0.000    0.000    0.286    0.286 {built-in method builtins.exec}
        1    0.006    0.006    0.286    0.286 fdtw.py:1(<module>)
    151/2    0.001    0.000    0.159    0.079 <frozen importlib._bootstrap>:1022(_find_and_load)
    151/2    0.001    0.000    0.159    0.079 <frozen importlib._bootstrap>:987(_find_and_load_unlocked)
    139/2    0.001    0.000    0.158    0.079 <frozen importlib._bootstrap>:664(_load_unlocked)
    110/2    0.000    0.000    0.158    0.079 <frozen importlib._bootstrap_external>:877(exec_module)
    217/2    0.000    0.000    0.158    0.079 <frozen importlib._bootstrap>:233(_call_with_frames_removed)
   180/16    0.001    0.000    0.154    0.010 <frozen importlib._bootstrap>:1053(_handle_fromlist)
    362/8    0.001    0.000    0.153    0.019 {built-in method builtins.__import__}
      100    0.117    0.001    0.122    0.001 {fastdtw._fastdtw.fastdtw}
      110    0.001    0.000    0.041    0.000 <frozen importlib._bootstrap_external>:950(get_code)
        1    0.000    0.000    0.029    0.029 multiarray.py:1(<module>)
        1    0.000    0.000    0.028    0.028 numeric.py:1(<module>)
        1    0.000    0.000    0.028    0.028 overrides.py:1(<module>)
      316    0.002    0.000    0.022    0.000 overrides.py:170(decorator)
      110    0.000    0.000    0.021    0.000 <frozen importlib._bootstrap_external>:670(_compile_bytecode)
      110    0.020    0.000    0.020    0.000 {built-in method marshal.loads}
  139/136    0.000    0.000    0.019    0.000 <frozen importlib._bootstrap>:564(module_from_spec)
      148    0.001    0.000    0.018    0.000 <frozen importlib._bootstrap>:921(_find_spec)
        2    0.000    0.000    0.017    0.008 shape_base.py:1(<module>)
      286    0.001    0.000    0.017    0.000 overrides.py:88(verify_matching_signatures)
        1    0.000    0.000    0.016    0.016 py3k.py:1(<module>)
      110    0.010    0.000    0.016    0.000 <frozen importlib._bootstrap_external>:1070(get_data)
      136    0.000    0.000    0.015    0.000 <frozen importlib._bootstrap_external>:1431(find_spec)
      136    0.001    0.000    0.015    0.000 <frozen importlib._bootstrap_external>:1399(_get_spec)
        1    0.000    0.000    0.015    0.015 fromnumeric.py:1(<module>)
      622    0.001    0.000    0.015    0.000 _inspect.py:96(getargspec)
       17    0.000    0.000    0.014    0.001 <frozen importlib._bootstrap_external>:1174(create_module)
       17    0.011    0.001    0.014    0.001 {built-in method _imp.create_dynamic}
      289    0.003    0.000    0.013    0.000 <frozen importlib._bootstrap_external>:1536(find_spec)
      151    0.000    0.000    0.012    0.000 <frozen importlib._bootstrap>:169(__enter__)
      454    0.001    0.000    0.012    0.000 <frozen importlib._bootstrap>:179(_get_module_lock)
      622    0.011    0.000    0.011    0.000 _inspect.py:26(isfunction)
      150    0.010    0.000    0.010    0.000 <frozen importlib._bootstrap>:71(__init__)
        1    0.000    0.000    0.010    0.010 _add_newdocs_scalars.py:1(<module>)
        1    0.000    0.000    0.010    0.010 _pickle.py:1(<module>)
      150    0.000    0.000    0.009    0.000 re.py:288(_compile)
        1    0.000    0.000    0.009    0.009 platform.py:1(<module>)
    17/12    0.000    0.000    0.009    0.001 <frozen importlib._bootstrap_external>:1182(exec_module)
    17/12    0.003    0.000    0.009    0.001 {built-in method _imp.exec_dynamic}
        1    0.000    0.000    0.008    0.008 pathlib.py:1(<module>)
       28    0.000    0.000    0.008    0.000 sre_compile.py:783(compile)
       25    0.000    0.000    0.008    0.000 re.py:249(compile)
  219/218    0.004    0.000    0.007    0.000 {built-in method builtins.__build_class__}
        1    0.000    0.000    0.006    0.006 index_tricks.py:1(<module>)
        1    0.000    0.000    0.005    0.005 _add_newdocs.py:1(<module>)
      313    0.001    0.000    0.005    0.000 function_base.py:475(add_newdoc)
       28    0.000    0.000    0.005    0.000 sre_parse.py:944(parse)
     1501    0.002    0.000    0.004    0.000 <frozen importlib._bootstrap_external>:126(_path_join)
        1    0.000    0.000    0.004    0.004 secrets.py:1(<module>)
      139    0.001    0.000    0.004    0.000 <frozen importlib._bootstrap>:492(_init_module_attrs)
    77/28    0.000    0.000    0.004    0.000 sre_parse.py:436(_parse_sub)
        1    0.000    0.000    0.004    0.004 numerictypes.py:1(<module>)
    82/30    0.002    0.000    0.004    0.000 sre_parse.py:494(_parse)
        1    0.000    0.000    0.004    0.004 pickle.py:1(<module>)
        1    0.000    0.000    0.004    0.004 subprocess.py:1(<module>)
      596    0.000    0.000    0.004    0.000 <frozen importlib._bootstrap_external>:140(_path_stat)
     2000    0.002    0.000    0.004    0.000 numerictypes.py:356(issubdtype)
        1    0.000    0.000    0.003    0.003 ntpath.py:1(<module>)
      596    0.003    0.000    0.003    0.000 {built-in method posix.stat}
       28    0.000    0.000    0.003    0.000 sre_compile.py:622(_code)
        1    0.000    0.000    0.003    0.003 version.py:1(<module>)
      326    0.002    0.000    0.003    0.000 functools.py:35(update_wrapper)
      220    0.001    0.000    0.003    0.000 <frozen importlib._bootstrap_external>:380(cache_from_source)
        1    0.000    0.000    0.003    0.003 defmatrix.py:1(<module>)
        1    0.000    0.000    0.003    0.003 defchararray.py:1(<module>)
      303    0.001    0.000    0.003    0.000 <frozen importlib._bootstrap>:216(_lock_unlock_module)
        2    0.000    0.000    0.003    0.001 _version.py:1(<module>)
        1    0.000    0.000    0.003    0.003 hmac.py:1(<module>)
... minor costs follow


Solution

  • FastDTW contains both a fast path implemented in Cython, and a slow path implemented in Python. It tries the fast path first, and if it gets an ImportError, it falls back to the slow path. You are probably hitting the fast path in the host, and the slow path in the container.

    Some evidence for this idea:

    • The functions called in both are completely different. In the fast one, about a quarter of the functions in the trace are NumPy related. In the slow one, NumPy doesn't appear at all.
    • builtins.min() appears in the slow profile. While both the fast path and slow path use min(), Cython compiles this into C code which makes no use of the Python min(). Therefore, the slower profile must be using the slow path in FastDTW.

    Source code where this feature is implemented:

    • See here for the code which implements the fallback.
    • See here for the fast path.
    • See here for the slow path.

    You can use the following code to check this:

    import inspect
    import fastdtw
    fastdtw.dtw
    inspect.getfile(fastdtw.dtw)
    

    Output, slow path:

    Python 3.10.10 (main, Jun 25 2023, 11:16:46) [Clang 14.0.3 (clang-1403.0.22.14.1)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import inspect
    >>> import fastdtw
    >>> fastdtw.dtw
    <function dtw at 0x10f9441f0>
    >>> inspect.getfile(fastdtw.dtw)
    '/Users/nick/.pyenv/versions/3.10.10/lib/python3.10/site-packages/fastdtw/fastdtw.py'
    

    Output, fast path:

    Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import inspect
    >>> import fastdtw
    >>> fastdtw.dtw
    <built-in function dtw>
    >>> inspect.getfile(fastdtw.dtw)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/lib/python3.10/inspect.py", line 797, in getfile
        raise TypeError('module, class, method, function, traceback, frame, or '
    TypeError: module, class, method, function, traceback, frame, or code object was expected, got builtin_function_or_method
    

    Timing of slow path: 16.39020 seconds
    Timing of fast path: 0.09339 seconds

    (Note: slow path and fast path benchmark are run on different hardware.)