Starting from a large set of points in the x--y plane, I would like to select a subset of these points such that are preferentially in a way that is defined by an ellipse with known major and minor axis. For example:
import numpy as np
npts = int(1e5)
lim = 3
x = np.random.uniform(-lim, lim, npts)
y = np.random.uniform(-lim, lim, npts)
major_axis = np.array((1, 1))
minor_axis = np.array((-0.25, 0.25))
The above two vectors define an ellipse with 4-1 axis ratio with major axis pointing along the line y = x. So I am trying to write down a Monte Carlo sampling algorithm whereby if a point in the x--y plane lies on the input major axis (in this case the line y=x), it has an enhanced probability of being selected over a point lying on the minor axis (in this case the line y = -x), where the probability enhancement factor is simply determined by the major-to-minor axis ratio (so a factor of 4 in this case).
I have been trying to do this using the pdf
method of scipy.stats.multivariate_normal
, but I think I must be using the method improperly. The way I'm going about it is to define the covariance matrix by treating the major and minor axes as the eigendirections, use the pdf
method on each point, sort these probabilities, and select the top Nselect
of those probabilities.
from scipy.stats import multivariate_normal
cov = np.array((major_axis, minor_axis))
p = np.vstack((x, y)).T
prob_select = multivariate_normal.pdf(p, cov=cov)
idx_select = np.argsort(prob_select)
Nselect = len(x)/10
result_x = x[idx_select][-Nselect:]
result_y = y[idx_select][-Nselect:]
fig, ax = plt.subplots(1, 1)
__=ax.scatter(result_x, result_y, s=1)
xlim = ax.set_xlim(-3, 3)
ylim = ax.set_ylim(-3, 3)
The above plot shows that something is incorrect in my algorithm, since the major axis of this ellipse does not lie on the line y=x. My suspicion was the covariance matrix is not defined correctly, but when I use the same covariance matrix with the rvs
method, I get the expected distribution:
correct_result = multivariate_normal.rvs(size=Nselect, cov=cov)
fig, ax = plt.subplots(1, 1)
__=ax.scatter(correct_result[:, 0], correct_result[:, 1], s=1)
xlim = ax.set_xlim(-3, 3)
ylim = ax.set_ylim(-3, 3)
Is there a simple mistake I am making in my usage of multivariate_normal.pdf
or covariance matrix definition? If the algorithm is flawed in some other way, is there a simpler way to define such a selection function starting from major/minor axes of an ellipse?
The covariance matrix is ill-formed here, and you can make no inferences from the resulting behavior. The fact that the rvs
method gives different results in this case is merely a reflection of the fact that the rvs
and pdf
functions preprocess their arguments differently. Whereas rvs
basically passes its parameters straight to numpy.multivariate_normal
...
# https://github.com/scipy/scipy/blob/v0.14.0/scipy/stats/_multivariate.py#L405
dim, mean, cov = _process_parameters(None, mean, cov)
out = np.random.multivariate_normal(mean, cov, size)
return _squeeze_output(out)
pdf
passes the covariance matrix to a function that calculates the pseudo-inverse:
# https://github.com/scipy/scipy/blob/v0.14.0/scipy/stats/_multivariate.py#L378
dim, mean, cov = _process_parameters(None, mean, cov)
x = _process_quantiles(x, dim)
prec_U, log_det_cov = _psd_pinv_decomposed_log_pdet(cov)
out = np.exp(self._logpdf(x, mean, prec_U, log_det_cov))
return _squeeze_output(out)
These are only guaranteed to give consistent results if the covariance matrix is well formed.
A covariance matrix is just the covariance of each of the corresponding pairings of dimensions, so it's by definition symmetric.
The docs reiterate this:
cov : 2-D array_like, of shape (N, N)
Covariance matrix of the distribution. It must be symmetric and positive-semidefinite for proper sampling.
Given that you want to get a covariance matrix from major and minor axes, what you really want is to solve a reverse eigenvector problem! Yay! I wish we had mathjax...
We need a symmetric matrix C = [[a, b], [b, a]]
such that [1, 1]
and [1, -1]
are eigenvectors, and we also want the ratio of eigenvalues to be four-to-one. That means C * [1, 1] = [4, 4]
and C * [1, -1] = [1, -1]
. Picking 1 as our minor index eigenvalue and 4 as our major index eigenvalue, and doing the matrix multiplication by hand, using variables, we get a + b = 4
and a - b = 1
. So A is 2.5 and b is 1.5, and C is [[2.5, 1.5], [1.5, 2.5]]
.
We can also use matrix equations to find a more direct solution. If E
is the matrix of eigenvectors [[1, 1], [1, -1]]
and lambda
is a diagonal matrix of eigenvalues [[4, 0], [0, 1]]
, then we are looking for the matrix X
such that:
X @ E = E @ lambda
Where @
indicates matrix multiplication (as in Python 3.5+).
That means
X = E @ lambda @ E ^ -1
In numpy
that's
>>> E = numpy.array([[1, 1], [1, -1]])
>>> lambda_ = numpy.array([[4, 0], [0, 1]])
>>> E @ lambda_ @ numpy.linalg.pinv(E)
array([[ 2.5, 1.5],
[ 1.5, 2.5]])
Using this as cov
in your code gives the following: