Search code examples
pythonmatplotlibdensity-plot

Cluster points based on density threshold


Updated my question. See below.

I have a scatter plot, with a lot of noise. I only want to plot points above a density threshold.

I calculated the density of the points with gaussian_kde, but I don't know how to implement the threshold. I thought of masking the points, but this doesn't work.

thresh = 10
x = x_data 
y = y_data
xy = np.vstack([x,y])
z = gaussian_kde(xy)(xy)

x1 = np.ma.masked_where(z > thresh, x) # mask points above threshold
y1 = np.ma.masked_where(z > thresh, y) # mask points above threshold

fig, ax = plt.subplots()
ax.scatter(x, y, c=z, s=10)

I expected a plot with fewer noise, but nothing changes when I plot x1 and y1. I only want to see the points with high density.


To reduce the noise I try to cluster the points based on their density. The density was calculated with gausian_kde.

I made a 3D scatter plot to estimate the thresholds to separate the clusters.

x = x_data
y = y_data
xy = np.vstack([x,y])
z = gaussian_kde(xy)(xy)

cI_t = 0.0000059
cI_x = np.ma.masked_where(z < cI_t, x).compressed()
cI_y = np.ma.masked_where(z < cI_t, y).compressed()
cII_t = 0.0000165
cII_x = np.ma.masked_where(z < cII_t, x).compressed()
cII_x_1 = cII_x[(cII_y <= 252)]
cII_y = np.ma.masked_where(z < cII_t, y).compressed()
cII_y_1 = cII_y[(cII_y >= 252)]
cIII_t = 0.0000048
cIII_x = np.ma.masked_where(z < cIII_t, x).compressed()
cIII_y = np.ma.masked_where(z < cIII_t, y).compressed()
cIV_t = 0.00003
cIV_x = np.ma.masked_where(z < cIV_t, x).compressed()
cIV_y = np.ma.masked_where(z < cIV_t, y).compressed()

# 3D Density plot
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(x, y, z)
plt.show()

# Scatter plot cII and cIV
fig2, ax2 = plt.subplots()
#plt.scatter(cI_x, cI_y)
plt.scatter(cII_x, cII_y)
#plt.scatter(cIII_x, cIII_y)
plt.scatter(cIV_x, cIV_y)
plt.axhline(y=255)
ax2.set_xlim(0,360)
ax2.set_ylim(0,360)
plt.show()

But know I need to select only the top blue points from cII cluster. Is there a way to select only the points above the blue line. (Ignore the orange dots, this is the cIV cluster.)


Solution

  • Solution:

    Example for cluster cII: I made a pandas dataframe, from the x and y data and then selected the points based of the values from the scatter plot.

    cII_t = 0.0000165
    cII_x = np.ma.masked_where(z < cII_t, x).compressed()
    cII_y = np.ma.masked_where(z < cII_t, y).compressed()
    cII_df = pd.DataFrame({"x" : cII_x, "y" : c2II_y})
    cII_df = cII_df[(cII_df["x"] >= 166) & (cII_df["x"] <= 227) & (cII_df["y"] >= 252) & (c2II_df["y"] <= 336)]
    cII_x = cII_df["x"]
    cII_y = cII_df["y"]
    

    The final plot: