In this example, I'm using the function multivariate_normal.cdf twice and I get two different results. The results are close but still different. Is there any adjustments I can make to get exactly the same result every time I use the function?
import numpy as np
from scipy.stats import multivariate_normal
def normal_cdf_3d(d_1, d_2, d_3, correl_1_2, correl_1_3, correl_2_3):
mean = np.array([0, 0, 0])
cov = np.array(
[
[1, correl_1_2, correl_1_3],
[correl_1_2, 1, correl_2_3],
[correl_1_3, correl_2_3, 1],
]
)
x = np.array([d_1, d_2, d_3])
y = multivariate_normal(mean=mean, cov=cov).cdf(x=x)
return y
d_1_ = -0.2304886114323222
d_2_ = -0.1479019945774904
d_3_ = 0.525
correl_1_2_ = 0.5500190982169267
correl_1_3_ = 0.9219544457292886
correl_2_3_ = 0.5916079783099616
a = normal_cdf_3d(
d_1=d_1_,
d_2=d_2_,
d_3=d_3_,
correl_1_2=correl_1_2_,
correl_1_3=correl_1_3_,
correl_2_3=correl_2_3_,
)
b = normal_cdf_3d(
d_1=d_1_,
d_2=d_2_,
d_3=d_3_,
correl_1_2=correl_1_2_,
correl_1_3=correl_1_3_,
correl_2_3=correl_2_3_,
)
print(a)
print(b)
if a == b:
print(True)
else:
print(False)
# Results
# 0.2698170436763295 (a)
# 0.2698184075101584 (b)
# False
Because the CDF has to be calculated by numeric integration in a multidimensional space, the multivariate_normal
object has relatively low relative and absolute tolerances for convergence. Additionally, the number of points to consider for integration is by default 1000000 * ndim
, which in this case is 3000000
.
import numpy as np
from scipy.stats import multivariate_normal
d_1 = -0.2304886114323222
d_2 = -0.1479019945774904
d_3 = 0.525
correl_1_2 = 0.5500190982169267
correl_1_3 = 0.9219544457292886
correl_2_3 = 0.5916079783099616
mean = np.array([0, 0, 0])
cov = np.array(
[
[1, correl_1_2, correl_1_3],
[correl_1_2, 1, correl_2_3],
[correl_1_3, correl_2_3, 1],
]
)
x = np.array([d_1, d_2, d_3])
mvn = multivariate_normal(mean=mean, cov=cov)
print(mvn.abseps, mvn.releps, mvn.maxpts)
# prints
1e-05 1e-05 3000000
Increasing the number of points and decreasing the tolerances will get you better precision at the cost of performance. Keep in mind that the default precision in Python is 1e-17 for floats.
mvn.cdf(x) - mvn.cdf(x)
# returns:
-9.741263303830738e-06 # Wall time: 1.27 ms
mvn.abseps = mvn.releps = 1e-8
mvn.maxpts = 5_000_000
mvn.cdf(x) - mvn.cdf(x)
# returns:
-3.6016106763625544e-10 # Wall time: 409 ms
mvn.abseps = mvn.releps = 1e-10
mvn.maxpts = 8_000_000
mvn.cdf(x) - mvn.cdf(x)
# returns:
-2.0803747613484802e-11 # Wall time: 2.22 s
mvn.abseps = mvn.releps = 1e-12
mvn.maxpts = 10_000_000
mvn.cdf(x) - mvn.cdf(x)
# returns:
-4.4660386500083861e-12 # Wall time: 3.39 s
This does not have great time scaling. A few more attempts ramping up the number of points until we get the same output if possible.
mvn.abseps = mvn.releps = 1e-17
mvn.maxpts = 35_000_000
mvn.cdf(x) - mvn.cdf(x)
# returns:
-1.755373624234835e-12 # Wall time: 11.1 s
mvn.abseps = mvn.releps = 1e-19
mvn.maxpts = 70_000_000
mvn.cdf(x) - mvn.cdf(x)
# returns:
2.7454705175955496e-12 # Wall time: 30 s
mvn.abseps = mvn.releps = 1e-25
mvn.maxpts = 100_000_000
mvn.cdf(x) - mvn.cdf(x)
# returns:
-2.1491142199181468e-12 # Wall time: 39.3 s
Based on the last 3 runs, there is a limit in the available precision of the calculation. You can just use being within some epsilon is acceptable as being the same number, even if it is less than Python's precision.
def approx_equal(x, y, tol=1e-5):
return abs(x-y) < tol