I have a question relating to the pow() function in Java's 17 new Vector API feature. I'm trying to implement the black scholes formula in a vectorized manner, but I'm having difficulty in obtaining the same performance as the scalar implementation
The code is as follows:
Here are some code snippets:
public static double[] createArray(int arrayLength)
{
double[] array0 = new double[arrayLength];
for(int i=0;i<arrayLength;i++)
{
array0[i] = 2.0;
}
return array0;
}
@Param({"256000"})
int arraySize;
public static final VectorSpecies<Double> SPECIES = DoubleVector.SPECIES_PREFERRED;
DoubleVector vectorTwo = DoubleVector.broadcast(SPECIES,2);
DoubleVector vectorHundred = DoubleVector.broadcast(SPECIES,100);
double[] scalarTwo = new double[]{2,2,2,2};
double[] scalarHundred = new double[]{100,100,100,100};
@Setup
public void Setup()
{
javaSIMD = new JavaSIMD();
javaScalar = new JavaScalar();
spotPrices = createArray(arraySize);
timeToMaturity = createArray(arraySize);
strikePrice = createArray(arraySize);
interestRate = createArray(arraySize);
volatility = createArray(arraySize);
e = new double[arraySize];
for(int i=0;i<arraySize;i++)
{
e[i] = Math.exp(1);
}
upperBound = SPECIES.loopBound(spotPrices.length);
}
@Benchmark
@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
public void testVectorPerformance(Blackhole bh) {
var upperBound = SPECIES.loopBound(spotPrices.length);
for (var i=0;i<upperBound; i+= SPECIES.length())
{
bh.consume(javaSIMD.calculateBlackScholesSingleCalc(spotPrices,timeToMaturity,strikePrice,
interestRate,volatility,e, i));
}
}
@Benchmark
@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
public void testScalarPerformance(Blackhole bh) {
for(int i=0;i<arraySize;i++)
{
bh.consume(javaScalar.calculateBlackScholesSingleCycle(spotPrices,timeToMaturity,strikePrice,
interestRate,volatility, i,normDist));
}
}
public DoubleVector calculateBlackScholesSingleCalc(double[] spotPrices, double[] timeToMaturity, double[] strikePrice,
double[] interestRate, double[] volatility, double[] e,int i){
...(skip lines)
DoubleVector vSpot = DoubleVector.fromArray(SPECIES, spotPrices, i);
...(skip lines)
DoubleVector powerOperand = vRateScaled
.mul(vTime)
.neg();
DoubleVector call = (vSpot
.mul(CDFVectorizedExcelOptimized(d1,vE)))
.sub(vStrike
.mul(vE
.pow(powerOperand))
.mul(CDFVectorizedExcelOptimized(d2,vE)));
return call;
Here are some JMH benchmarks (2 forks,2 warmups,2 iterations) on a Ryzen 5800X using WSL: Overall, it seems ~2x slower vs the scalar version. I ran a simple time before vs time after separately, of the method without JMH and it seems inline.
Result "blackScholes.TestJavaPerf.testScalarPerformance":
0.116 ±(99.9%) 0.002 ops/ms [Average]
89873915287 cycles:u # 4.238 GHz (40.43%)
242060738532 instructions:u # 2.69 insn per cycle
Result "blackScholes.TestJavaPerf.testVectorPerformance":
0.071 ±(99.9%) 0.001 ops/ms [Average]
90878787665 cycles:u # 4.072 GHz (39.25%)
254117779312 instructions:u # 2.80 insn per cycle
I also enabled diagnostic options for the JVM. I see the following:
"-XX:+UnlockDiagnosticVMOptions", "-XX:+PrintIntrinsics","-XX:+PrintAssembly"
0x00007fe451959413: call 0x00007fe451239f00 ; ImmutableOopMap {rsi=Oop }
;*synchronization entry
; - jdk.incubator.vector.DoubleVector::arrayAddress@-1 (line 3283)
; {runtime_call counter_overflow Runtime1 stub}
0x00007fe451959418: jmp 0x00007fe4519593ce
0x00007fe45195941a: movabs $0x7fe4519593ee,%r10 ; {internal_word}
0x00007fe451959424: mov %r10,0x358(%r15)
0x00007fe45195942b: jmp 0x00007fe451193100 ; {runtime_call SafepointBlob}
0x00007fe451959430: nop
0x00007fe451959431: nop
0x00007fe451959432: mov 0x3d0(%r15),%rax
0x00007fe451959439: movq $0x0,0x3d0(%r15)
0x00007fe451959444: movq $0x0,0x3d8(%r15)
0x00007fe45195944f: add $0x40,%rsp
0x00007fe451959453: pop %rbp
0x00007fe451959454: jmp 0x00007fe451231e80 ; {runtime_call unwind_exception Runtime1 stub}
0x00007fe451959459: hlt
<More halts cut off>
[Exception Handler]
0x00007fe451959460: call 0x00007fe451234580 ; {no_reloc}
0x00007fe451959465: movabs $0x7fe46e76df9a,%rdi ; {external_word}
0x00007fe45195946f: and $0xfffffffffffffff0,%rsp
0x00007fe451959473: call 0x00007fe46e283d40 ; {runtime_call}
0x00007fe451959478: hlt
[Deopt Handler Code]
0x00007fe451959479: movabs $0x7fe451959479,%r10 ; {section_word}
0x00007fe451959483: push %r10
0x00007fe451959485: jmp 0x00007fe4511923a0 ; {runtime_call DeoptimizationBlob}
0x00007fe45195948a: hlt
<More halts cut off>
--------------------------------------------------------------------------------
============================= C2-compiled nmethod ==============================
** svml call failed for double_pow_32
@ 3 jdk.internal.misc.Unsafe::loadFence (0 bytes) (intrinsic)
@ 3 jdk.internal.misc.Unsafe::loadFence (0 bytes) (intrinsic)
@ 2 java.lang.Math::pow (6 bytes) (intrinsic)
Investigations/Questions:
Note: I believe it is using 256bit width vectors (checked during debugging)
This might be related to JDK-8262275, Math vector stubs are not called for double64 vectors
For Double64Vector, the svml math vector stubs intrinsification is failing and they are not being called from jitted code.
But we do have svml double64 vectors.
You might try alternative operations, e.g. instead of vE.pow(powerOperand)
with vE
being a vector of e, you can use powerOperand.lanewise(VectorOperators.EXP)
to perform ex for all lanes.
Keep in mind that this API is work in progress in incubator state…