Search code examples
pythonnumpyrandomseabornnormal-distribution

Seaborn data visualization misunderstanding of densities?


I was playing around with the seaborn library for data visualization and trying to display a standard normal distribution. The basics in this case look something like:

import numpy as np
import seaborn as sns

n=1000
N= np.random.randn(n)
fig=sns.displot(N,kind="kde")

Which behaves as expected. My problem starts when I try to plot multiple distributions at the same time. I tried the brute N2= np.random.randn(n//2) and fig=sns.displot((N,N2),kind="kde"), which returns two distributions (as wanted), but the one with smaller sample size is significantly different (and flatter). Regardless of the sample size, a proper density plot (or histogram) should have the area below the graph equal to one, but this is clearly not the case.

Knowing that seaborn works with pandas Dataframes, I've tried with the more elaborate (and generally bad and inefficient, but I hope clear) code below to attempt again multiple distributions on the same graph:

import numpy as np
import seaborn as sns
import pandas as pd
n=10000

N_1= np.reshape(np.random.randn(n),(n,1))
N_2= np.reshape(np.random.randn(int(n/2)),(int(n/2),1))
N_3= np.reshape(np.random.randn(int(n/4)),(int(n/4),1))

A_1 = np.reshape(np.array(['n1' for _ in range(n)]),(n,1))
A_2 = np.reshape(np.array(['n2' for _ in range(int(n/2))]),(int(n/2),1))
A_3 = np.reshape(np.array(['n3' for _ in range(int(n/4))]),(int(n/4),1))

F_1=np.concatenate((N_1,A_1),1)
F_2=np.concatenate((N_2,A_2),1)
F_3=np.concatenate((N_3,A_3),1)

F= pd.DataFrame(data=np.concatenate((F_1,F_2,F_3),0),columns=["datar","cat"])
F["datar"]=F.datar.astype('float')
fig=sns.displot(F,x="datar",hue="cat",kind="kde")

Which shows again very different (almost scaled) distributions, confirming that the result in this case is not consistent with what I was expecting (namely, roughly overlapping distributions). Am I not understanding how this graph works? There is a completely different approach to draw multiple distributions on the same graph that I am missing?


Solution

  • Seaborn works happily with and without dataframes. Columns of dataframes get converted to numpy arrays in order to draw the plots.

    sns.displot(..., kind="kde") refers to sns.kdeplot() which has a parameter common_norm defaulting to True. Setting it to False draws the curves independently.

    import numpy as np
    import seaborn as sns
    from matplotlib import pyplot as plt
    
    n = 10000
    
    N_1 = np.random.randn(n)
    N_2 = np.random.randn(n // 2) + 2
    N_3 = np.random.randn(n // 4) + 4
    
    sns.displot((N_1, N_2, N_3), kind="kde", common_norm=False)
    plt.show()
    

    resulting plot

    Note that for kdeplot, the option common_norm defaulting to True makes sense, as with kdeplot you can also create plots with three separate calls which automatically will be independent. There also is a useful option multiple (defaulting to 'layer'), which can be set to 'stack' or to 'fill'.

    comparing plot