Search code examples
pythonpandasmatplotlibscatter-plot

Use first N colors from qualitative cmap to plot cluster scatter


I want to plot my clusters against their first two principle components, but using only the first N colors from matplotlibs 'Set1' cmap (dependent on number of clusters).

I understand I can access the color list and can slice it to get the number of colors I want, however when attempting this I get the error:

ValueError: array([[0.89411765, 0.10196078, 0.10980392, 1. ], [1. , 0.49803922, 0. , 1. ], [0.6 , 0.6 , 0.6 , 1. ]]) is not a valid value for name; supported values are 'Accent'...

...suggesting to me I have got the RGA values of the colors but not the names themselves?

This is the code I am attempting it with (where k is the number of clusters):

cmap = cm.get_cmap('Set1')
cmap = cmap(np.linspace(0, 1, k)) 

points = ax.scatter(data['PC1'], data['PC2'],c=data['cluster'], cmap=cmap ,alpha=0.7)
ax.legend(*points.legend_elements(), title='test')

Solution

  • One possible solution is to loop over the unique cluster values:

    import pandas as pd
    x = np.random.uniform(size=10)
    y = np.random.uniform(size=10)
    color_val = np.random.randint(1, 5, 10)
    df = pd.DataFrame({"PC1": x, "PC2": y, "cluster": color_val})
    
    unique_color_val = df["cluster"].unique()
    colors = cm.get_cmap('Set1').colors[:len(unique_color_val)]
    
    plt.figure()
    for i, ucv in enumerate(unique_color_val):
        sub_df = df[df["cluster"] == ucv]
        plt.scatter(sub_df["PC1"], sub_df["PC2"], color=colors[i], label="color val = %s" % ucv)
    plt.legend()
    plt.show()