Search code examples
pythonnumpyfor-loopvectorizationarray-broadcasting

Vectorizing taking longer than loop


My function that computes Lorentzian given freq, fwhm, amp. I want to vectorize it so that it does the computation for a list of freqs, fwhms and amps:

def lorz1(freq_series, freq, fwhm, amp):
    numerator   = fwhm
    denominator = (2*np.pi) * ((freq_series[:,None] - freq)**2 + fwhm**2/4)
    lor         = numerator / denominator
    main_peak   = amp*(lor/np.linalg.norm(lor, axis=0))
    return np.sum(main_peak, axis=1)


def lorz2(freq_series, freq, fwhm, amp):
    numerator   = fwhm[:,None]
    denominator = (2*np.pi) * ((freq_series - freq[:,None])**2 + fwhm[:,None]**2/4)
    lor         = numerator / denominator
    main_peak   = amp[:,None]*(lor/np.linalg.norm(lor, axis=1)[:,None])
    return np.sum(main_peak, axis=0)


def lorz3(freq_series, freq, fwhm, amp):
    numerator   = fwhm
    denominator = (2*np.pi) * ((freq_series - freq)**2 + fwhm**2/4)
    lor         = numerator / denominator
    main_peak   = amp*(lor/np.linalg.norm(lor))
    return main_peak


series = np.linspace(0,100,50000)
freq   = np.random.uniform(5,50,50)
fwhm   = np.random.uniform(0.01,0.05,50)
amps   = np.random.uniform(5,500,50)

Timing:

%timeit lorz1(series, freq, fwhm, amps)

38.4 ms ± 1.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit lorz2(series, freq, fwhm, amps)

29.8 ms ± 1.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit np.sum(np.array([lorz3(series, item1, item2, item3)
                         for (item1,item2,item3) in zip(freq, fwhm, amps)]), axis=0)

24.1 ms ± 5.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Where am I going wrong with the vectorization in lorz1 and lorz2? Aren't they supposed to be faster than lorz3?


Solution

  • I did some further profiling using two versions:

    Version 1:

    #!/usr/bin/env python3
    
    import numpy as np
    
    
    def lorz2(freq_series, freq, fwhm, amp):
        numerator   = fwhm[:,None]
        denominator = (2*np.pi) * ((freq_series - freq[:,None])**2 + fwhm[:,None]**2/4)
        lor         = numerator / denominator
        main_peak   = amp[:,None]*(lor/np.linalg.norm(lor, axis=1)[:,None])
        return np.sum(main_peak, axis=0)
    
    
    series = np.linspace(0,100,50000)
    freq   = np.random.uniform(5,50,50)
    fwhm   = np.random.uniform(0.01,0.05,50)
    amps   = np.random.uniform(5,500,50)
    
    for _ in range(100):
        lorz2(series, freq, fwhm, amps)
    

    and version 2:

    #!/usr/bin/env python3
    
    
    import numpy as np
    
    
    def lorz3(freq_series, freq, fwhm, amp):
        numerator   = fwhm
        denominator = (2*np.pi) * ((freq_series - freq)**2 + fwhm**2/4)
        lor         = numerator / denominator
        main_peak   = amp*(lor/np.linalg.norm(lor))
        return main_peak
    
    
    series = np.linspace(0,100,50000)
    freq   = np.random.uniform(5,50,50)
    fwhm   = np.random.uniform(0.01,0.05,50)
    amps   = np.random.uniform(5,500,50)
    
    
    for _ in range(100):
        sum(lorz3(series, item1, item2, item3)
            for (item1,item2,item3) in zip(freq, fwhm, amps))
    

    Notice how I tweaked the summation for lorz3 into a plain old Python sum. This is faster in my tests since it avoids the temporary array construction.

    Here are the results of some profiling I did:

    perf stat -ddd ./lorz2.py
    
     Performance counter stats for './lorz2.py':
    
               2729.16 msec task-clock:u                     #    1.000 CPUs utilized             
                     0      context-switches:u               #    0.000 /sec                      
                     0      cpu-migrations:u                 #    0.000 /sec                      
                217114      page-faults:u                    #   79.554 K/sec                     
            8192141440      cycles:u                         #    3.002 GHz                         (38.43%)
            3178961202      instructions:u                   #    0.39  insn per cycle              (46.12%)
             426575242      branches:u                       #  156.303 M/sec                       (53.81%)
               2177628      branch-misses:u                  #    0.51% of all branches             (61.51%)
           42020185035      slots:u                          #   15.397 G/sec                       (69.20%)
             323473974      topdown-retiring:u               #      0.6% Retiring                   (69.20%)
           33616148028      topdown-bad-spec:u               #     67.1% Bad Speculation            (69.20%)
             371211166      topdown-fe-bound:u               #      0.7% Frontend Bound             (69.20%)
           15767347418      topdown-be-bound:u               #     31.5% Backend Bound              (69.20%)
             813550722      L1-dcache-loads:u                #  298.096 M/sec                       (69.19%)
             546814255      L1-dcache-load-misses:u          #   67.21% of all L1-dcache accesses   (69.21%)
              82889242      LLC-loads:u                      #   30.372 M/sec                       (69.22%)
              67633317      LLC-load-misses:u                #   81.59% of all LL-cache accesses    (69.24%)
       <not supported>      L1-icache-loads:u                                                     
               9705348      L1-icache-load-misses:u          #    0.00% of all L1-icache accesses   (30.81%)
             864895659      dTLB-loads:u                     #  316.909 M/sec                       (30.79%)
                117310      dTLB-load-misses:u               #    0.01% of all dTLB cache accesses  (30.78%)
       <not supported>      iTLB-loads:u                                                          
                 85530      iTLB-load-misses:u               #    0.00% of all iTLB cache accesses  (30.76%)
       <not supported>      L1-dcache-prefetches:u                                                
       <not supported>      L1-dcache-prefetch-misses:u                                           
    
           2.729696014 seconds time elapsed
    
           1.932708000 seconds user
           0.796504000 seconds sys
    

    And here the faster version:

    
     Performance counter stats for './lorz3.py':
    
                878.49 msec task-clock:u                     #    0.999 CPUs utilized             
                     0      context-switches:u               #    0.000 /sec                      
                     0      cpu-migrations:u                 #    0.000 /sec                      
                 52869      page-faults:u                    #   60.182 K/sec                     
            3704170220      cycles:u                         #    4.217 GHz                         (38.22%)
            3735225800      instructions:u                   #    1.01  insn per cycle              (45.96%)
             568575253      branches:u                       #  647.221 M/sec                       (53.70%)
               2580294      branch-misses:u                  #    0.45% of all branches             (61.43%)
           18355798588      slots:u                          #   20.895 G/sec                       (69.17%)
            3328525030      topdown-retiring:u               #     17.5% Retiring                   (69.17%)
            6982401815      topdown-bad-spec:u               #     36.6% Bad Speculation            (69.17%)
            1297505291      topdown-fe-bound:u               #      6.8% Frontend Bound             (69.17%)
            7459691283      topdown-be-bound:u               #     39.1% Backend Bound              (69.17%)
             858082535      L1-dcache-loads:u                #  976.773 M/sec                       (69.28%)
             430569310      L1-dcache-load-misses:u          #   50.18% of all L1-dcache accesses   (69.40%)
              15723297      LLC-loads:u                      #   17.898 M/sec                       (69.49%)
                 73709      LLC-load-misses:u                #    0.47% of all LL-cache accesses    (69.50%)
       <not supported>      L1-icache-loads:u                                                     
              38705486      L1-icache-load-misses:u          #    0.00% of all L1-icache accesses   (30.72%)
             860276161      dTLB-loads:u                     #  979.270 M/sec                       (30.60%)
                 86213      dTLB-load-misses:u               #    0.01% of all dTLB cache accesses  (30.51%)
       <not supported>      iTLB-loads:u                                                          
                 91069      iTLB-load-misses:u               #    0.00% of all iTLB cache accesses  (30.50%)
       <not supported>      L1-dcache-prefetches:u                                                
       <not supported>      L1-dcache-prefetch-misses:u                                           
    
           0.878946776 seconds time elapsed
    
           0.852205000 seconds user
           0.026744000 seconds sys
    

    Notice how the number of instructions is actually slightly higher in the faster code, which makes sense since it is less vectorized, but the much higher instructions per cycle make it faster overall. There are twice as many LLC loads in the slower version, of which most miss while here almost all hit. I'm not sure how to interpret the topdown-bad-spec counter. Maybe someone else can comment on that.

    The CPU even clocks down (this is reproducible) which supports the idea that it is simply waiting on memory.

    Further, notice the sys time in the last line. lorz2 spends 28% of its runtime in kernel space. Since it doesn't do anything IO-related, that is all memory allocation and deallocation overhead.

    We can look a bit further at the stall reasons:

    perf stat -e cycles,l1d_pend_miss.l2_stall,cycle_activity.stalls_l3_miss ./lorz2.py
    
     Performance counter stats for './lorz2.py':
    
            8446540078      cycles:u                                                              
            1953955881      l1d_pend_miss.l2_stall:u                                              
            3050292324      cycle_activity.stalls_l3_miss:u                                       
    
           2.748141433 seconds time elapsed
    
           1.959570000 seconds user
           0.788443000 seconds sys
    
    perf stat -e cycles,l1d_pend_miss.l2_stall,cycle_activity.stalls_l3_miss ./lorz3.py
    
     Performance counter stats for './lorz3.py':
    
            3674547216      cycles:u                                                              
             303870088      l1d_pend_miss.l2_stall:u                                              
              16939496      cycle_activity.stalls_l3_miss:u                                       
    
           0.869909182 seconds time elapsed
    
           0.848122000 seconds user
           0.021752000 seconds sys
    

    So, the lorz2 version just stalls constantly on level 2 or 3 cache misses.

    We can further look at a simple perf report

    perf record ./lorz2.py
    perf report
    
    
    # Overhead  Command  Shared Object                                      Symbol                                         
    # ........  .......  .................................................  ...............................................
    #
        32.44%  python3  _multiarray_umath.cpython-311-x86_64-linux-gnu.so  [.] DOUBLE_multiply
        27.03%  python3  _multiarray_umath.cpython-311-x86_64-linux-gnu.so  [.] DOUBLE_divide
        17.34%  python3  _multiarray_umath.cpython-311-x86_64-linux-gnu.so  [.] DOUBLE_add
         6.93%  python3  _multiarray_umath.cpython-311-x86_64-linux-gnu.so  [.] DOUBLE_subtract
         6.12%  python3  _multiarray_umath.cpython-311-x86_64-linux-gnu.so  [.] DOUBLE_pairwise_sum
         5.32%  python3  _multiarray_umath.cpython-311-x86_64-linux-gnu.so  [.] DOUBLE_square
         0.51%  python3  [unknown]                                          [k] 0xffffffffb2800fe7
         0.46%  python3  libpython3.11.so.1.0                               [.] _PyEval_EvalFrameDefault
         0.19%  python3  libpython3.11.so.1.0                               [.] unicodekeys_lookup_unicode
         0.18%  python3  libpython3.11.so.1.0                               [.] gc_collect_main
    ...
    
    perf record ./lorz3.py
    perf report
    
    # Overhead  Command  Shared Object                                      Symbol                                       
    # ........  .......  .................................................  .............................................
    #
        27.56%  python3  libblas.so.3.11.0                                  [.] ddot_
        27.47%  python3  _multiarray_umath.cpython-311-x86_64-linux-gnu.so  [.] DOUBLE_divide
         8.64%  python3  _multiarray_umath.cpython-311-x86_64-linux-gnu.so  [.] DOUBLE_subtract
         8.34%  python3  _multiarray_umath.cpython-311-x86_64-linux-gnu.so  [.] DOUBLE_add
         5.84%  python3  _multiarray_umath.cpython-311-x86_64-linux-gnu.so  [.] DOUBLE_multiply
         3.70%  python3  _multiarray_umath.cpython-311-x86_64-linux-gnu.so  [.] DOUBLE_square
         1.47%  python3  libpython3.11.so.1.0                               [.] _PyEval_EvalFrameDefault
         1.38%  python3  libgcc_s.so.1                                      [.] execute_cfa_program
         0.89%  python3  libgcc_s.so.1                                      [.] uw_update_context_1
         0.88%  python3  libgcc_s.so.1                                      [.] _Unwind_Find_FDE
         0.64%  python3  libgcc_s.so.1                                      [.] uw_frame_state_for
         0.61%  python3  _multiarray_umath.cpython-311-x86_64-linux-gnu.so  [.] ufunc_generic_fastcall
    ...
    

    Huh, that's interesting. Where does the dot product come from? I assume this is how linalg.norm is implemented for simple vectors.

    Incidentally, we can speed up the lorz2 and lorz3 versions slightly via 3 measures:

    1. Fold multiplication and summation into one matrix multiplication
    2. Reorder some operations to execute them on smaller arrays (or scalars)
    3. Replace divisions with multiplications by the inverse
    def lorz2a(freq_series, freq, fwhm, amp):
        numerator   = fwhm[:,None] * (0.5 / np.pi)
        denominator = (freq_series - freq[:,None])**2 + 0.25 * fwhm[:,None]**2
        lor         = numerator / denominator
        return (amp / np.linalg.norm(lor, axis=1)) @ lor
    
        
    def lorz3a(freq_series, freq, fwhm, amp):
        numerator   = fwhm * (0.5 / np.pi)
        denominator = (freq_series - freq)**2 + 0.25 * fwhm**2
        lor         = numerator / denominator
        main_peak   = amp / np.linalg.norm(lor) * lor
        return main_peak
    

    This does not change anything on the overall trends, however.

    In conclusion

    Numpy vectorization primarily helps reducing per-call overhead. Once the arrays are large enough, we don't get much benefit from it since the remaining interpreter overhead is small compared to the computations itself. Simultaneously, larger arrays result in reduced memory efficiency. Typically there is a sweet-spot somewhere around L2 or L3 cache size. The lorz3 implementation hits this spot better than the others.

    For a smaller series size and a larger size of the other arrays, we can expect lorz2 to perform better. For example this data set makes my lorz2a faster than my lorz3a:

    series = np.linspace(0,100,1000)
    freq   = np.random.uniform(5,50,2000)
    fwhm   = np.random.uniform(0.01,0.05,2000)
    amps   = np.random.uniform(5,500,2000)
    

    Numpy's simple, eager evaluation scheme puts the onus of tuning for this on the user. Other libraries like NumExpr try to avoid this.