Search code examples
pythonplotlyhistogram

Plotly Histogram scaled by the total number of data


I want to obtain a comparison between the values of two datasets, however one of my datasets contains a lot more data than the other and I want to normalize the histogram according to the size of data. I leave below an example of code.

import numpy as np
import pandas as pd
import plotly.graph_objects as go

df_rand_A = pd.DataFrame(np.random.randn(200, 1), columns=['value'])
df_rand_A['kind'] = 'A'
df_rand_B = pd.DataFrame(np.random.randn(2000, 1), columns=['value'])
df_rand_B['kind'] = 'B'
df_rand = pd.concat([df_rand_A, df_rand_B])

fig = go.Figure()
df_grouped = df_rand.groupby('kind')
for kind, group in df_grouped:
    fig.add_trace(go.Histogram(x=group.value, name=kind, showlegend=True, bingroup=1))

fig.update_traces(opacity=0.75)
fig.update_layout(barmode="overlay", bargap=0, margin=dict(l=20, r=20, t=20, b=20), font=dict(size=13), width=800, height=500,legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="right",x=1))

fig.show()

Solution

  • Use a plotly distplot to plot a distribution - https://plotly.com/python/distplot/

    Alternatively, calculate the histogram distributions using np.histogram and divide the values of the bars by the sum of the values in the histogram to get a normalized histogram.

    import plotly.graph_objects as go
    import numpy as np
    import plotly.io as pio
    
    pio.renderers.default='browser'
    
    data1 = np.random.normal(0, 1, 100000)
    data2 = np.random.normal(0, 1, 1000000)
    hist1 = np.histogram(data1)
    hist2 = np.histogram(data2)
    
    fig = go.Figure()
    fig.add_trace(go.Bar(x=hist1[1], y = hist1[0]/sum(hist1[0])))
    fig.add_trace(go.Bar(x=hist2[1], y = hist2[0]/sum(hist2[0])))
    fig.show()
    

    Result: Normalized histofram/distplot