Search code examples
pythonmatplotlibseabornhistogramcentral-limit-theorem

Overlapping Histograms in Seaborn


I’m plotting two histograms using Seaborn’s histplot function. The first histogram represents my entire dataset, while the second one is a subset of the first. However, the second histogram doesn’t seem to overlap with the first one as expected. Here’s the code I’m using:

import numpy as np
from scipy.stats import norm

data = np.sin(np.arange(0, 6*np.pi, 0.1)) * 100
sns.scatterplot(x=[np.mean(data)], y=[0])
sns.lineplot(data)

population_size = 10000
sample_size = 100
total_means = []
for x in range(population_size):
    total_means.append(np.mean(np.random.choice(data, 100)))

total_means = np.array(total_means)
sns.histplot(total_means, kde=True)

# Q. Find the range for 68% of data will lie in that interval
from scipy.stats import norm
z1 = norm.ppf(.50 - .68/2)
se = np.array(data).std() / sample_size ** .5
x1 = z1 * se + np.array(data).mean()
z2 = norm.ppf(.50 + .68/2)
x2 = z2 * se + np.array(data).mean()
print(x1, x2)

plt.xticks(np.arange(total_means.min(), total_means.max(), 10))
plt.xticks(np.arange(0, 500, 100))
sns.histplot(total_means, kde=True)
sns.histplot(total_means[(total_means >= x1) & (total_means <= x2)], kde=True, color='r')

On Stack Overflow, it’s recommended to avoid posting the complete code. However, I have some data that could be used to quickly address the issue without needing to generate new data.

In my code, the last two lines plot two histograms. However, from the resulting graph, it’s clear that these histograms do not overlap as expected.

sns.histplot(total_means, kde=True)
sns.histplot(total_means[(total_means >= x1) & (total_means <= x2)], kde=True, color='r')

Histogram graph


Solution

  • The problem is that for the red plot, smaller bins are used (the default is calculated via np.histogram_bin_edges()). Smaller bins mean each bin gets less values inside.

    Here is some code to show the difference between using default bins and using the same bins for both histograms. The first and last bin of the subset are smaller than those of the larger histogram, as the bin edges don't coincide with the limits of the subset.

    import matplotlib.pyplot as plt
    import seaborn as sns
    import numpy as np
    
    total_means = np.random.randn(2000)
    
    fig, (ax1, ax2) = plt.subplots(ncols=2)
    
    sns.histplot(total_means, kde=True, ax=ax1)
    sns.histplot(total_means[(total_means >= -1) & (total_means <= 1)], kde=True, ax=ax1)
    ax1.set_title('default bins')
    
    bins = np.linspace(total_means.min(), total_means.max(), 30)
    sns.histplot(total_means, bins=bins, kde=True, ax=ax2)
    sns.histplot(total_means[(total_means >= -1) & (total_means <= 1)], bins=bins, kde=True, ax=ax2)
    ax2.set_title('same bins for both')
    
    plt.show()
    

    sns.histplot with same bins