Search code examples
python-3.xmatplotlibplotseaborndensity-plot

Plotting multiple density curves on the same plot: weighting the subset categories in Python 3


I am trying to recreate this density plot in python 3: math.stackexchange.com/questions/845424/the-expected-outcome-of-a-random-game-of-chess

End Goal: I need my density plot to look like this

The area under the blue curve is equal to that of the red, green, and purple curves combined because the different outcomes (Draw, Black wins, and White wins) are the subset of the total (All).

How do I have python realize and plot this accordingly?

Here is the .csv file of results_df after 1000 simulations pastebin.com/YDVMx2DL

from matplotlib import pyplot as plt
import seaborn as sns

black = results_df.loc[results_df['outcome'] == 'Black']
white = results_df.loc[results_df['outcome'] == 'White']
draw = results_df.loc[results_df['outcome'] == 'Draw']
win = results_df.loc[results_df['outcome'] != 'Draw']

Total = len(results_df.index)
Wins = len(win.index)

PercentBlack = "Black Wins ≈ %s" %('{0:.2%}'.format(len(black.index)/Total))
PercentWhite = "White Wins ≈ %s" %('{0:.2%}'.format(len(white.index)/Total))
PercentDraw = "Draw ≈ %s" %('{0:.2%}'.format(len(draw.index)/Total))
AllTitle = 'Distribution of Moves by All Outcomes (nSample = %s)' %(workers)

sns.distplot(results_df.moves, hist=False, label = "All")
sns.distplot(black.moves, hist=False, label=PercentBlack)
sns.distplot(white.moves, hist=False, label=PercentWhite)
sns.distplot(draw.moves, hist=False, label=PercentDraw)
plt.title(AllTitle)
plt.ylabel('Density')
plt.xlabel('Number of Moves')
plt.legend()
plt.show()

The code above produces density curves without weights, which I really need to figure out how to generate density curve weights accordingly as well as preserve my labels in the legend

density curves, no weights; help

I also tried frequency histograms, that scaled the distribution heights correctly but I would rather keep the 4 curves overlaid on top of each other for a "cleaner" look...I don't like this frequency plot but this is my current fix at the moment.

results_df.moves.hist(alpha=0.4, bins=range(0, 700, 10), label = "All")
draw.moves.hist(alpha=0.4, bins=range(0, 700, 10), label = PercentDraw)
white.moves.hist(alpha=0.4, bins=range(0, 700, 10), label = PercentWhite)
black.moves.hist(alpha=0.4, bins=range(0, 700, 10), label = PercentBlack)
plt.title(AllTitle)
plt.ylabel('Frequency')
plt.xlabel('Number of Moves')
plt.legend()
plt.show()

If anyone can write the python 3 code that outputs the first plot with 4 density curves with correct subset weights as well as preserves the custom legend that show percentages, that would be much appreciated.

Once the density curves are plotted with the correct subset weights, I am also interested in the python 3 code in finding the max point coordinates of each density curve that shows max frequency of moves once I scale it up to 500,000 iterations.

Thanks


Solution

  • You need to be careful. The plot that you have produced is correct. All the curves shown are probability density functions of the underlying distributions.

    In the plot that you want to have, only the curve labeled "All" is a probability density function. The other curves are not.

    In any case, you will need to calculate the kernel density estimate yourself, if you want to scale it like shown in the desired plot. This can be done using scipy.stats.gaussial_kde().

    In order to reproduce the desired plot, I see two options.

    Calculate the kde for all involved cases and scale them with the number of samples.

    import numpy as np; np.random.seed(0)
    import matplotlib.pyplot as plt
    import scipy.stats
    
    a = np.random.gumbel(80, 25, 1000).astype(int)
    b = np.random.gumbel(200, 46, 4000).astype(int)
    
    kdea = scipy.stats.gaussian_kde(a)
    kdeb = scipy.stats.gaussian_kde(b)
    
    both = np.hstack((a,b))
    kdeboth = scipy.stats.gaussian_kde(both)
    grid = np.arange(500)
    
    #weighted kde curves
    wa = kdea(grid)*(len(a)/float(len(both)))
    wb = kdeb(grid)*(len(b)/float(len(both)))
    
    print "a.sum ", wa.sum()
    print "b.sum ", wb.sum()
    print "total.sum ", kdeb(grid).sum()
    
    fig, ax = plt.subplots()
    ax.plot(grid, wa, lw=1, label = "weighted a")
    ax.plot(grid, wb, lw=1, label = "weighted b")
    ax.plot(grid, kdeboth(grid), color="crimson", lw=2, label = "pdf")
    
    plt.legend()
    plt.show()
    

    enter image description here

    Calculate the kde for all individual cases, normalize their sum to obtain the total.

    import numpy as np; np.random.seed(0)
    import matplotlib.pyplot as plt
    import scipy.stats
    
    a = np.random.gumbel(80, 25, 1000).astype(int)
    b = np.random.gumbel(200, 46, 4000).astype(int)
    
    kdea = scipy.stats.gaussian_kde(a)
    kdeb = scipy.stats.gaussian_kde(b)
    
    grid = np.arange(500)
    
    
    #weighted kde curves
    wa = kdea(grid)*(len(a)/float(len(a)+len(b)))
    wb = kdeb(grid)*(len(b)/float(len(a)+len(b)))
    
    total = wa+wb
    
    fig, ax = plt.subplots(figsize=(5,3))
    ax.plot(grid, wa, lw=1, label = "weighted a")
    ax.plot(grid, wb, lw=1, label = "weighted b")
    ax.plot(grid, total, color="crimson", lw=2, label = "pdf")
    
    plt.legend()
    plt.show()
    

    enter image description here