Assuming my data normally distributed I defined a pseudo confidence area in an area of my data, considering only the coloured points. I called the upper bound ub and the lower bound lb.
I want to sample my data within that ellipse, I did it in this way:
sampled_ids = pd_pca.loc[
pd_pca.pc1.between(pd_pca_stats.loc['pc1', 'lb'], pd_pca_stats.loc['pc1', 'ub']) & \
pd_pca.pc2.between(pd_pca_stats.loc['pc2', 'lb'], pd_pca_stats.loc['pc2', 'ub'])] \
.sample(10)
The approach above, however, is not totally correct, because it samples from a square and not from an ellipse.
Do you have a good approach to sample my data from the ellipse?
You need a mask for the ellipse. Let's assumed its centered on (x,y)
with semi axes (a,b)
, and assuming that the main axes of your ellipse follow the Cartesian axes (otherwise you need to compose with a rotation).
Then your mask will be
ellipse_mask = (pd_pca_stats.loc['pc1'] - x)**2/a**2 + (pd_pca_stats.loc['pc2'] - y)**2/b**2 <= 1
sampled_ids = pd_pca[ellipse_mask].sample(10)