Search code examples
pythonpandasdata-sciencesamplingconfidence-interval

Sample data from defined interval in Pandas


Assuming my data normally distributed I defined a pseudo confidence area in an area of my data, considering only the coloured points. I called the upper bound ub and the lower bound lb.
enter image description here

I want to sample my data within that ellipse, I did it in this way:

sampled_ids = pd_pca.loc[
    pd_pca.pc1.between(pd_pca_stats.loc['pc1', 'lb'], pd_pca_stats.loc['pc1', 'ub']) & \
    pd_pca.pc2.between(pd_pca_stats.loc['pc2', 'lb'], pd_pca_stats.loc['pc2', 'ub'])] \
.sample(10)

The approach above, however, is not totally correct, because it samples from a square and not from an ellipse.
Do you have a good approach to sample my data from the ellipse?


Solution

  • You need a mask for the ellipse. Let's assumed its centered on (x,y) with semi axes (a,b), and assuming that the main axes of your ellipse follow the Cartesian axes (otherwise you need to compose with a rotation).

    Then your mask will be

    ellipse_mask = (pd_pca_stats.loc['pc1'] - x)**2/a**2 + (pd_pca_stats.loc['pc2'] - y)**2/b**2 <= 1
    sampled_ids = pd_pca[ellipse_mask].sample(10)