Search code examples
pythonpython-3.xgraphplotlyhistogram

Python: How to overlay histograms using Plotly


I have two sets of data in separate lists. Each list element has a value from 0:100, and elements repeat.

For example:
first_data = [10,20,40,100,...,100,10,50]
second_data = [20,50,50,10,...,70,10,100]

I can plot one of these in a histogram using:

import plotly.graph_objects as go
.
.
.

fig = go.Figure()
fig.add_trace(go.Histogram(histfunc='count', x=first_data))
fig.show()

By setting histfunc to 'count', my histogram consists of an x-axis from 0 to 100 and bars for the number of repeated elements in first_data.

My question is: How can I overlay the second set of data over the same axis using the same "count" histogram?


Solution

  • One method to do this is by simply adding another trace, you were nearly there! The dataset used to create these examples, can be found in the last section of this post.

    Note:
    The following code uses the 'lower-level' plotly API, as (personally) I feel it's more transparent and enables the user to see what is being plotted, and why; rather than relying on the convenience modules of graph_objects and express.

    Option 1 - Overlaid Bars:

    from plotly.offline import plot
    
    layout = {}
    traces = []
    
    traces.append({'x': data1, 'name': 'D1', 'opacity': 1.0})
    traces.append({'x': data2, 'name': 'D2', 'opacity': 0.5})
    
    # For each trace, add elements which are common to both.
    for t in traces:
        t.update({'type': 'histogram',
                  'histfunc': 'count',
                  'nbinsx': 50})
    
    layout['barmode'] = 'overlay'
    
    plot({'data': traces, 'layout': layout})
    

    Output 1:

    enter image description here

    Option 2 - Curve Plot:

    Another option is to plot the curve (Gaussian KDE) of the distribution, as shown here. It's worth noting that this method plots the probability density, rather than the counts.

    X1, Y1 = calc_curve(data1)
    X2, Y2 = calc_curve(data2)
    
    traces = []
    traces.append({'x': X1, 'y': Y1, 'name': 'D1'})
    traces.append({'x': X2, 'y': Y2, 'name': 'D2'})
    
    plot({'data': traces})
    

    Output 2:

    enter image description here

    Associated calc_curve() function:

    from scipy.stats import gaussian_kde
    
    def calc_curve(data):
        """Calculate probability density."""
        min_, max_ = data.min(), data.max()
        X = [min_ + i * ((max_ - min_) / 500) for i in range(501)]
        Y = gaussian_kde(data).evaluate(X)
        return(X, Y)
    

    Option 3 - Plot Bars and Curves:

    Or, you can always combine the two methods together, using the probability density on the yaxis.

    layout = {}
    traces = []
    
    traces.append({'x': data1, 'name': 'D1', 'opacity': 1.0})
    traces.append({'x': data2, 'name': 'D2', 'opacity': 0.5})
    
    for t in traces:
        t.update({'type': 'histogram',
                  'histnorm': 'probability density',
                  'nbinsx': 50})
    
    traces.append({'x': X1, 'y': Y1, 'name': 'D1'})
    traces.append({'x': X2, 'y': Y2, 'name': 'D2'})
    
    layout['barmode'] = 'overlay'
    
    plot({'data': traces, 'layout': layout})  
    

    Output 3:

    enter image description here

    Dataset:

    Here is the bit of code used to simulate your dataset of [0,100] values, and to create these examples:

    import numpy as np
    from sklearn.preprocessing import MinMaxScaler
    
    mms = MinMaxScaler((0, 100))
    np.random.seed(4)
    data1 = mms.fit_transform(np.random.randn(10000).reshape(-1, 1)).ravel()
    data2 = mms.fit_transform(np.random.randn(10000).reshape(-1, 1)).ravel()