Search code examples
pythonpandascolorsscatter

df.plot.scatter: c and cmap


I have a dataframe (nb - the data is dummy data and does not represent what is in the plots):

    Index     BGC frequency - Count     Proportion of total BGCs both captured and not captured by antiSMASH - %
  species_a            1                                       2
  species_b            3                                       4
     ...              ...                                     ...

I want to make a scatter plot of BGC frequency - Count vs Proportion of total BGCs both captured and not captured by antiSMASH - %, with points coloured according to the categorical Index, and a legend.

import matplotlib.pyplot as plt
from matplotlib import colors
import pandas as pd

colorlist = list(colors.ColorConverter.colors.keys())
captured_df.plot.scatter(x='BGC frequency - Count', 
                         y= 'Proportion of total BGCs both captured and not captured by antiSMASH - %' , 
                          c = colorlist,
                         title = 'BGCs with an antiSMASH region')

Gets me close:

Dataframe scatter plot

But I cant get a legend. Ideally I'd want something like what is shown here, line 69:

desired format of dataframe scatter plot

But when i tried:

df.plot.scatter(x='BGC frequency - Count', y='Proportion of total BGCs both captured and not captured by antiSMASH - %', c=df.index, cmap="viridis", s=50)

I get:

ValueError: 'c' argument must be a mpl color, a sequence of mpl colors or a sequence of numbers, not Index(...list of index species names...)

I'm not sure why this is - I thought cmap converts the c data into a list of the correct data type? The link above is explicitly dealing with categorical data -

If a categorical column is passed to c, then a discrete colorbar will be produced

Also please note I dont want a numerical color bar - this would not be much use:

bad scatter plot

Thanks for reading :D


Solution

  • The trick is to convert the "type" column to categorical (in your case the Index column).

    For example:

    d = pd.DataFrame([["a", 1,3], ["b", 3,3], ["b", 2,3], ["a", 5,2]], columns=['type', 'x', 'y'])
    d['type'] = pd.Categorical(d['type'])
    d.plot.scatter(x='x', y='y', c='type', cmap='inferno')
    plt.show()
    

    enter image description here

    This should work.

    Also it is worth mentioning that this feature is from Pandas version 1.3.0 (July 2. 2021)!

    Make sure that you use the appropriate version.