Use of a categorical variable to define colors and markers in bokeh scatter plot

I have a pandas dataframe with two data columns (for simplicity let us call them 'x' and 'y'), and a categorical column (say 'color' with values 'red', 'green', and 'blue'). Now I want to use bokeh to generate a scatter plot with different marker symbols ('red'->'x', 'green'->'o', and 'blue'->'triangle').

While I did find a solution where I extracted the relevant portions of 'x' and 'y' values manually, I thought it should be possible to do this in one command using "categorical" plotting in bokeh. However, the documentation primarily considers bar plots, and when I try to use the result of df.groupby('color') in ColumnDataSource, plotting 'x' and 'y' in scatter (with source=source) fails, because the column names 'x' and 'y' are not found.

Here is a sample code to illustrate the problem:

import pandas as pd
import bokeh.plotting as plt

df = pd.DataFrame(data=[[0., 0., 'red'], [1., 0., 'red'], [1., 1., 'green'],
                        [1., 2., 'blue'], [2., 1., 'blue']],
                  columns=['x', 'y', 'color'])
source = plt.ColumnDataSource(df.groupby('color'))
# source = plt.ColumnDataSource(df) -- this would work for colors
fig = plt.figure()
fig.scatter('x', 'y', color='color', source=source)
plt.show(fig)

This snippet presents the minimum needed. Without the groupby, color='color' actually works, but in my real example, the categorical variable has non-color values. Furthermore, how would I specify multiple symbols as requested?

Solution

UPDATE: The original answer below is still valid, but this kind of thing can now also be accomplished more easily with color and marker mapping transforms:

from bokeh.plotting import figure, show
from bokeh.sampledata.iris import flowers
from bokeh.transform import factor_cmap, factor_mark

SPECIES = ['setosa', 'versicolor', 'virginica']
MARKERS = ['hex', 'circle_x', 'triangle']

p = figure(title = "Iris Morphology")
p.xaxis.axis_label = 'Petal Length'
p.yaxis.axis_label = 'Sepal Width'

p.scatter("petal_length", "sepal_width", source=flowers, legend_field="species", fill_alpha=0.4, size=12,
          marker=factor_mark('species', MARKERS, SPECIES),
          color=factor_cmap('species', 'Category10_3', SPECIES))

show(p)

Original Answer

Passing a GroupBy to the CDS is not going to be helpful for you, because that creates a CDS of the summarize data, but you want all the individual points. Here is one way to accomplish what you are asking using CDSView and GroupFilter as described in Providing Data for Plots and Tables:

import pandas as pd

from bokeh.io import show
from bokeh.models import ColumnDataSource, CDSView, GroupFilter
from bokeh.plotting import figure


df = pd.DataFrame(data=[[0., 0., 'red'], [1., 0., 'red'], [1., 1., 'green'],
                        [1., 2., 'blue'], [2., 1., 'blue']],
                  columns=['x', 'y', 'color'])

source = ColumnDataSource(df)

# create views for the different groups
red = CDSView(source=source, filters=[GroupFilter(column_name='color', group='red')])
green = CDSView(source=source, filters=[GroupFilter(column_name='color', group='green')])
blue = CDSView(source=source, filters=[GroupFilter(column_name='color', group='blue')])

p = figure()

# use the views with different glyphs
p.circle('x', 'y', size=15, color='red', source=source, view=red)
p.square('x', 'y', size=15, color='green', source=source, view=green)
p.triangle('x', 'y', size=15, color='blue', source=source, view=blue)

show(p)

Looking at that it seems like there are some pretty simple and easy improvements that could be made to reduce the amount of code (e.g. maybe a source.group methods to do all the work those CDSView lines do, or maybe arguments to the glyph methods to specify groups). I'd encourage you to submit a GitHub feature request issue to discuss it further.