Search code examples
pythonnumpyscipymontecarloscientific-computing

Use scipy to define a monte carlo sampling ellipse in the xy plane


Starting from a large set of points in the x--y plane, I would like to select a subset of these points such that are preferentially in a way that is defined by an ellipse with known major and minor axis. For example:

import numpy as np 

npts = int(1e5)
lim = 3
x = np.random.uniform(-lim, lim, npts)
y = np.random.uniform(-lim, lim, npts)

major_axis = np.array((1, 1))
minor_axis = np.array((-0.25, 0.25))

The above two vectors define an ellipse with 4-1 axis ratio with major axis pointing along the line y = x. So I am trying to write down a Monte Carlo sampling algorithm whereby if a point in the x--y plane lies on the input major axis (in this case the line y=x), it has an enhanced probability of being selected over a point lying on the minor axis (in this case the line y = -x), where the probability enhancement factor is simply determined by the major-to-minor axis ratio (so a factor of 4 in this case).

I have been trying to do this using the pdf method of scipy.stats.multivariate_normal, but I think I must be using the method improperly. The way I'm going about it is to define the covariance matrix by treating the major and minor axes as the eigendirections, use the pdf method on each point, sort these probabilities, and select the top Nselect of those probabilities.

from scipy.stats import multivariate_normal
cov = np.array((major_axis, minor_axis))
p = np.vstack((x, y)).T
prob_select = multivariate_normal.pdf(p, cov=cov)
idx_select = np.argsort(prob_select)
Nselect = len(x)/10
result_x = x[idx_select][-Nselect:]
result_y = y[idx_select][-Nselect:]

fig, ax = plt.subplots(1, 1)
__=ax.scatter(result_x, result_y, s=1)
xlim = ax.set_xlim(-3, 3)
ylim = ax.set_ylim(-3, 3)

incorrect distribution

The above plot shows that something is incorrect in my algorithm, since the major axis of this ellipse does not lie on the line y=x. My suspicion was the covariance matrix is not defined correctly, but when I use the same covariance matrix with the rvs method, I get the expected distribution:

correct_result = multivariate_normal.rvs(size=Nselect, cov=cov)
fig, ax = plt.subplots(1, 1)
__=ax.scatter(correct_result[:, 0], correct_result[:, 1], s=1)
xlim = ax.set_xlim(-3, 3)
ylim = ax.set_ylim(-3, 3)

expected distribution from <code>rvs</code>

Is there a simple mistake I am making in my usage of multivariate_normal.pdf or covariance matrix definition? If the algorithm is flawed in some other way, is there a simpler way to define such a selection function starting from major/minor axes of an ellipse?


Solution

  • Why the results are inconsistent

    The covariance matrix is ill-formed here, and you can make no inferences from the resulting behavior. The fact that the rvs method gives different results in this case is merely a reflection of the fact that the rvs and pdf functions preprocess their arguments differently. Whereas rvs basically passes its parameters straight to numpy.multivariate_normal...

    # https://github.com/scipy/scipy/blob/v0.14.0/scipy/stats/_multivariate.py#L405
    dim, mean, cov = _process_parameters(None, mean, cov)
    out = np.random.multivariate_normal(mean, cov, size)
    return _squeeze_output(out)
    

    pdf passes the covariance matrix to a function that calculates the pseudo-inverse:

    # https://github.com/scipy/scipy/blob/v0.14.0/scipy/stats/_multivariate.py#L378
    dim, mean, cov = _process_parameters(None, mean, cov)
    x = _process_quantiles(x, dim)
    prec_U, log_det_cov = _psd_pinv_decomposed_log_pdet(cov)
    out = np.exp(self._logpdf(x, mean, prec_U, log_det_cov))
    return _squeeze_output(out)
    

    These are only guaranteed to give consistent results if the covariance matrix is well formed.

    Building a well formed matrix

    A covariance matrix is just the covariance of each of the corresponding pairings of dimensions, so it's by definition symmetric.

    The docs reiterate this:

    cov : 2-D array_like, of shape (N, N)

    Covariance matrix of the distribution. It must be symmetric and positive-semidefinite for proper sampling.

    Given that you want to get a covariance matrix from major and minor axes, what you really want is to solve a reverse eigenvector problem! Yay! I wish we had mathjax...

    We need a symmetric matrix C = [[a, b], [b, a]] such that [1, 1] and [1, -1] are eigenvectors, and we also want the ratio of eigenvalues to be four-to-one. That means C * [1, 1] = [4, 4] and C * [1, -1] = [1, -1]. Picking 1 as our minor index eigenvalue and 4 as our major index eigenvalue, and doing the matrix multiplication by hand, using variables, we get a + b = 4 and a - b = 1. So A is 2.5 and b is 1.5, and C is [[2.5, 1.5], [1.5, 2.5]].

    We can also use matrix equations to find a more direct solution. If E is the matrix of eigenvectors [[1, 1], [1, -1]] and lambda is a diagonal matrix of eigenvalues [[4, 0], [0, 1]], then we are looking for the matrix X such that:

    X @ E = E @ lambda
    

    Where @ indicates matrix multiplication (as in Python 3.5+).

    That means

    X = E @ lambda @ E ^ -1
    

    In numpy that's

    >>> E = numpy.array([[1, 1], [1, -1]])
    >>> lambda_ = numpy.array([[4, 0], [0, 1]])
    >>> E @ lambda_ @ numpy.linalg.pinv(E)
    array([[ 2.5,  1.5],
           [ 1.5,  2.5]])
    

    Using this as cov in your code gives the following:

    Chart of a correctly skewed ellipse.