Search code examples
pythonbokehscatter

Use of a categorical variable to define colors and markers in bokeh scatter plot


I have a pandas dataframe with two data columns (for simplicity let us call them 'x' and 'y'), and a categorical column (say 'color' with values 'red', 'green', and 'blue'). Now I want to use bokeh to generate a scatter plot with different marker symbols ('red'->'x', 'green'->'o', and 'blue'->'triangle').

While I did find a solution where I extracted the relevant portions of 'x' and 'y' values manually, I thought it should be possible to do this in one command using "categorical" plotting in bokeh. However, the documentation primarily considers bar plots, and when I try to use the result of df.groupby('color') in ColumnDataSource, plotting 'x' and 'y' in scatter (with source=source) fails, because the column names 'x' and 'y' are not found.

Here is a sample code to illustrate the problem:

import pandas as pd
import bokeh.plotting as plt

df = pd.DataFrame(data=[[0., 0., 'red'], [1., 0., 'red'], [1., 1., 'green'],
                        [1., 2., 'blue'], [2., 1., 'blue']],
                  columns=['x', 'y', 'color'])
source = plt.ColumnDataSource(df.groupby('color'))
# source = plt.ColumnDataSource(df) -- this would work for colors
fig = plt.figure()
fig.scatter('x', 'y', color='color', source=source)
plt.show(fig)

This snippet presents the minimum needed. Without the groupby, color='color' actually works, but in my real example, the categorical variable has non-color values. Furthermore, how would I specify multiple symbols as requested?


Solution

  • UPDATE: The original answer below is still valid, but this kind of thing can now also be accomplished more easily with color and marker mapping transforms:

    from bokeh.plotting import figure, show
    from bokeh.sampledata.iris import flowers
    from bokeh.transform import factor_cmap, factor_mark
    
    SPECIES = ['setosa', 'versicolor', 'virginica']
    MARKERS = ['hex', 'circle_x', 'triangle']
    
    p = figure(title = "Iris Morphology")
    p.xaxis.axis_label = 'Petal Length'
    p.yaxis.axis_label = 'Sepal Width'
    
    p.scatter("petal_length", "sepal_width", source=flowers, legend_field="species", fill_alpha=0.4, size=12,
              marker=factor_mark('species', MARKERS, SPECIES),
              color=factor_cmap('species', 'Category10_3', SPECIES))
    
    show(p)
    

    enter image description here


    Original Answer

    Passing a GroupBy to the CDS is not going to be helpful for you, because that creates a CDS of the summarize data, but you want all the individual points. Here is one way to accomplish what you are asking using CDSView and GroupFilter as described in Providing Data for Plots and Tables:

    import pandas as pd
    
    from bokeh.io import show
    from bokeh.models import ColumnDataSource, CDSView, GroupFilter
    from bokeh.plotting import figure
    
    
    df = pd.DataFrame(data=[[0., 0., 'red'], [1., 0., 'red'], [1., 1., 'green'],
                            [1., 2., 'blue'], [2., 1., 'blue']],
                      columns=['x', 'y', 'color'])
    
    source = ColumnDataSource(df)
    
    # create views for the different groups
    red = CDSView(source=source, filters=[GroupFilter(column_name='color', group='red')])
    green = CDSView(source=source, filters=[GroupFilter(column_name='color', group='green')])
    blue = CDSView(source=source, filters=[GroupFilter(column_name='color', group='blue')])
    
    p = figure()
    
    # use the views with different glyphs
    p.circle('x', 'y', size=15, color='red', source=source, view=red)
    p.square('x', 'y', size=15, color='green', source=source, view=green)
    p.triangle('x', 'y', size=15, color='blue', source=source, view=blue)
    
    show(p)
    

    plot of grouped data

    Looking at that it seems like there are some pretty simple and easy improvements that could be made to reduce the amount of code (e.g. maybe a source.group methods to do all the work those CDSView lines do, or maybe arguments to the glyph methods to specify groups). I'd encourage you to submit a GitHub feature request issue to discuss it further.