Search code examples
pythonperformancepandasmatplotlibbokeh

Why is bokeh so much slower than matplotlib


I plotted a box plot in Bokeh and another in matplotlib. Plotting in Bokeh was about 100 times slower for the same data. Why does Bokeh take so long? Here is the code, I ran this in Jupyter notebook:

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import matplotlib as mpl

from bokeh.charts import BoxPlot, output_notebook, show

from time import time

%matplotlib inline


# Generate data
N = 100000
x1 = 2 + np.random.randn(N)
y1 = ['a'] * N

x2 = -2 + np.random.randn(N)
y2 = ['b'] * N

X = list(x1) + list(x2)
Y = y1 + y2

data = pd.DataFrame()
data['Vals'] = X
data['Class'] = Y

df = data.apply(np.random.permutation)


# Time the bokeh plot
start_time = time()

p = BoxPlot(data, values='Vals', label='Class',\
            title="MPG Summary (grouped by CYL, ORIGIN)")
output_notebook()
show(p)

end_time = time()
print("Total time taken for Bokeh is {0}".format(end_time - start_time))


# time the matplotlib plot
start_time = time()

data.boxplot(column='Vals', by='Class', sym = 'o')

end_time = time()
print("Total time taken for matplotlib is {0}".format(end_time - start_time))

The print statements produce the following outputs:

Total time taken for Bokeh is 11.8056321144104

Total time taken for matplotlib is 0.1586170196533203


Solution

  • There is some problem specifically with bokeh.charts.BoxPlot. Unfortunately, bokeh.charts does not have a maintainer at the moment, so I can't state anything about when it might get fixed or improved.

    However, in case it is useful to you, I will demonstrate below that you can use the well-established and stable bokeh.plotting API to do things "by hand", and then the time is comparable to if not faster than MPL:

    from time import time
    
    import pandas as pd
    import numpy as np
    
    from bokeh.io import output_notebook, show
    from bokeh.plotting import figure
    
    output_notebook()
    
    # Generate data
    N = 100000
    x1 = 2 + np.random.randn(N)
    y1 = ['a'] * N
    
    x2 = -2 + np.random.randn(N)
    y2 = ['b'] * N
    
    X = list(x1) + list(x2)
    Y = y1 + y2
    
    df = pd.DataFrame()
    df['Vals'] = X
    df['Class'] = Y
    
    # Time the bokeh plot
    start_time = time()
    
    # find the quartiles and IQR for each category
    groups = df.groupby('Class')
    q1 = groups.quantile(q=0.25)
    q2 = groups.quantile(q=0.5)
    q3 = groups.quantile(q=0.75)
    iqr = q3 - q1
    upper = q3 + 1.5*iqr
    lower = q1 - 1.5*iqr
    
    cats = ['a', 'b']
    
    p = figure(x_range=cats)
    
    # if no outliers, shrink lengths of stems to be no longer than the minimums or maximums
    qmin = groups.quantile(q=0.00)
    qmax = groups.quantile(q=1.00)
    upper.score = [min([x,y]) for (x,y) in zip(list(qmax.loc[:,'Vals']),upper.Vals)]
    lower.score = [max([x,y]) for (x,y) in zip(list(qmin.loc[:,'Vals']),lower.Vals)]
    
    # stems
    p.segment(cats, upper.Vals, cats, q3.Vals, line_color="black")
    p.segment(cats, lower.Vals, cats, q1.Vals, line_color="black")
    
    # boxes
    p.vbar(cats, 0.7, q2.Vals, q3.Vals, fill_color="#E08E79", line_color="black")
    p.vbar(cats, 0.7, q1.Vals, q2.Vals, fill_color="#3B8686", line_color="black")
    
    # whiskers (almost-0 height rects simpler than segments)
    p.rect(cats, lower.Vals, 0.2, 0.01, line_color="black")
    p.rect(cats, upper.Vals, 0.2, 0.01, line_color="black")
    
    p.xgrid.grid_line_color = None
    p.ygrid.grid_line_color = "white"
    p.grid.grid_line_width = 2
    p.xaxis.major_label_text_font_size="12pt"
    
    show(p)
    
    end_time = time()
    print("Total time taken for Bokeh is {0}".format(end_time - start_time))
    

    It's a chunk of code but it would be simple enough to wrap up into a re-usable function. For me, the above resulted in:

    enter image description here