java performance vectorization simd java-17

Understanding Java 17 Vector slowness and performance with pow operator

I have a question relating to the pow() function in Java's 17 new Vector API feature. I'm trying to implement the black scholes formula in a vectorized manner, but I'm having difficulty in obtaining the same performance as the scalar implementation

The code is as follows:

I create an array of doubles (currently, just 5.0)
I loop over elements of that array (different looping syntax for scalar and vector)
I create DoubleVectors from the double arrays within and do calculations (or just calculations for scalar) I am trying to do e^(value), and I believe that is the problem

Here are some code snippets:

    public static double[] createArray(int arrayLength)
    {
        double[] array0 = new double[arrayLength];
        for(int i=0;i<arrayLength;i++)
        {
            array0[i] = 2.0;
        }
        return array0;
    }

    @Param({"256000"})
    int arraySize;
    public static final VectorSpecies<Double> SPECIES = DoubleVector.SPECIES_PREFERRED;
    DoubleVector vectorTwo =  DoubleVector.broadcast(SPECIES,2);
    DoubleVector vectorHundred =  DoubleVector.broadcast(SPECIES,100);

    double[] scalarTwo = new double[]{2,2,2,2};
    double[] scalarHundred  = new double[]{100,100,100,100};

    @Setup
    public void Setup()
    {
        javaSIMD = new JavaSIMD();
        javaScalar = new JavaScalar();
        spotPrices = createArray(arraySize);
        timeToMaturity = createArray(arraySize);
        strikePrice = createArray(arraySize);
        interestRate = createArray(arraySize);
        volatility = createArray(arraySize);
        e = new double[arraySize];
        for(int i=0;i<arraySize;i++)
        {
            e[i] = Math.exp(1);
        }
        upperBound = SPECIES.loopBound(spotPrices.length);
    }
    @Benchmark
    @BenchmarkMode(Mode.Throughput)
    @OutputTimeUnit(TimeUnit.MILLISECONDS)
    public void testVectorPerformance(Blackhole bh) {
        var upperBound = SPECIES.loopBound(spotPrices.length);
        for (var i=0;i<upperBound; i+= SPECIES.length())
        {
            bh.consume(javaSIMD.calculateBlackScholesSingleCalc(spotPrices,timeToMaturity,strikePrice,
                    interestRate,volatility,e, i));
        }
    }

    @Benchmark
    @BenchmarkMode(Mode.Throughput)
    @OutputTimeUnit(TimeUnit.MILLISECONDS)
    public void testScalarPerformance(Blackhole bh) {
        for(int i=0;i<arraySize;i++)
        {
            bh.consume(javaScalar.calculateBlackScholesSingleCycle(spotPrices,timeToMaturity,strikePrice,
                    interestRate,volatility, i,normDist));
        }
    }

    public DoubleVector calculateBlackScholesSingleCalc(double[] spotPrices, double[] timeToMaturity, double[] strikePrice,
                                                        double[] interestRate, double[] volatility, double[] e,int i){
...(skip lines)
        DoubleVector vSpot = DoubleVector.fromArray(SPECIES, spotPrices, i);
...(skip lines)
        DoubleVector powerOperand = vRateScaled
                .mul(vTime)
                .neg();
        DoubleVector call  = (vSpot
                .mul(CDFVectorizedExcelOptimized(d1,vE)))
                .sub(vStrike
                .mul(vE
                        .pow(powerOperand))
                .mul(CDFVectorizedExcelOptimized(d2,vE)));
        return call;

Here are some JMH benchmarks (2 forks,2 warmups,2 iterations) on a Ryzen 5800X using WSL: Overall, it seems ~2x slower vs the scalar version. I ran a simple time before vs time after separately, of the method without JMH and it seems inline.

Result "blackScholes.TestJavaPerf.testScalarPerformance":
  0.116 ±(99.9%) 0.002 ops/ms [Average]
       89873915287      cycles:u                  #    4.238 GHz                      (40.43%)
      242060738532      instructions:u            #    2.69  insn per cycle   

      
Result "blackScholes.TestJavaPerf.testVectorPerformance":
  0.071 ±(99.9%) 0.001 ops/ms [Average]
       90878787665      cycles:u                  #    4.072 GHz                      (39.25%)
      254117779312      instructions:u            #    2.80  insn per cycle

I also enabled diagnostic options for the JVM. I see the following:

"-XX:+UnlockDiagnosticVMOptions", "-XX:+PrintIntrinsics","-XX:+PrintAssembly"

  0x00007fe451959413:   call   0x00007fe451239f00           ; ImmutableOopMap {rsi=Oop }
                                                            ;*synchronization entry
                                                            ; - jdk.incubator.vector.DoubleVector::arrayAddress@-1 (line 3283)
                                                            ;   {runtime_call counter_overflow Runtime1 stub}
  0x00007fe451959418:   jmp    0x00007fe4519593ce
  0x00007fe45195941a:   movabs $0x7fe4519593ee,%r10         ;   {internal_word}
  0x00007fe451959424:   mov    %r10,0x358(%r15)
  0x00007fe45195942b:   jmp    0x00007fe451193100           ;   {runtime_call SafepointBlob}
  0x00007fe451959430:   nop
  0x00007fe451959431:   nop
  0x00007fe451959432:   mov    0x3d0(%r15),%rax
  0x00007fe451959439:   movq   $0x0,0x3d0(%r15)
  0x00007fe451959444:   movq   $0x0,0x3d8(%r15)
  0x00007fe45195944f:   add    $0x40,%rsp
  0x00007fe451959453:   pop    %rbp
  0x00007fe451959454:   jmp    0x00007fe451231e80           ;   {runtime_call unwind_exception Runtime1 stub}
  0x00007fe451959459:   hlt    
<More halts cut off>   
[Exception Handler]
  0x00007fe451959460:   call   0x00007fe451234580           ;   {no_reloc}
  0x00007fe451959465:   movabs $0x7fe46e76df9a,%rdi         ;   {external_word}
  0x00007fe45195946f:   and    $0xfffffffffffffff0,%rsp
  0x00007fe451959473:   call   0x00007fe46e283d40           ;   {runtime_call}
  0x00007fe451959478:   hlt    
[Deopt Handler Code]
  0x00007fe451959479:   movabs $0x7fe451959479,%r10         ;   {section_word}
  0x00007fe451959483:   push   %r10
  0x00007fe451959485:   jmp    0x00007fe4511923a0           ;   {runtime_call DeoptimizationBlob}
  0x00007fe45195948a:   hlt    
<More halts cut off>
--------------------------------------------------------------------------------

============================= C2-compiled nmethod ==============================
  ** svml call failed for double_pow_32
                                            @ 3   jdk.internal.misc.Unsafe::loadFence (0 bytes)   (intrinsic)
                                            @ 3   jdk.internal.misc.Unsafe::loadFence (0 bytes)   (intrinsic)
                                          @ 2   java.lang.Math::pow (6 bytes)   (intrinsic)

Investigations/Questions:

Im writing different implementations of the formula, it is not 1:1 - could this be the cause? Looking at the number of instructions according to JMH, there is roughly a 12billion difference in num of instructions. With vectorization the processor runs at a lower clock rate as well.
Is the choice of input numbers a problem? I've tried i+10/(array.Length) as well.
Is there a reason I see that the SVML call fail for double_pow_32 ? I don't see this problem for smaller input array sizes BTW
I changed the pow to mul (for both,obviously the eq is now very different) but it seems to be much faster as a result, results are as expected scalar vs vector

Note: I believe it is using 256bit width vectors (checked during debugging)

Solution

This might be related to JDK-8262275, Math vector stubs are not called for double64 vectors

For Double64Vector, the svml math vector stubs intrinsification is failing and they are not being called from jitted code.
But we do have svml double64 vectors.

You might try alternative operations, e.g. instead of vE.pow(powerOperand) with vE being a vector of e, you can use powerOperand.lanewise(VectorOperators.EXP) to perform e^x for all lanes.

Keep in mind that this API is work in progress in incubator state…