Search code examples
python-3.xmatplotlibscikit-learnhistogramcluster-analysis

clustering 1D data and representing clusters on matplotlib histogram


I have 1D data in the format of:

areas = ...
plt.figure(figsize=(10, 10))
plt.hist(areas, bins=80)
plt.show()

The plot of this looks something along the lines of this:

enter image description here

Now I want to be able to cluster this data. I know that I have the option of either Kernel Density Estimation or K-Means. But once I have these values, how am I represent this clusters on the histogram?


Solution

  • You just need to figure out your cluster assignment, and then plot each subset of the data individually while taking care that the bins are the same each time.

    enter image description here

    import numpy as np
    import matplotlib.pyplot as plt
    
    from sklearn.cluster import KMeans
    
    import matplotlib as mpl
    mpl.rcParams['axes.spines.top'] = False
    mpl.rcParams['axes.spines.right'] = False
    
    # simulate some fake data
    n = 10000
    mu1, sigma1 = 0, 1
    mu2, sigma2 = 6, 2
    a = mu1 + sigma1 * np.random.randn(n)
    b = mu2 + sigma2 * np.random.randn(n)
    data = np.concatenate([a, b])
    
    # determine which K-Means cluster each point belongs to
    cluster_id = KMeans(2).fit_predict(data.reshape(-1, 1))
    
    # determine densities by cluster assignment and plot
    fig, ax = plt.subplots()
    bins = np.linspace(data.min(), data.max(), 40)
    for ii in np.unique(cluster_id):
        subset = data[cluster_id==ii]
        ax.hist(subset, bins=bins, alpha=0.5, label=f"Cluster {ii}")
    ax.legend()
    plt.show()