Search code examples
pythonhistogramprobabilityestimation

Histogram based probability density estimation


How do I make a histogram based probability density estimate of each of the marginal distributions p(x1 ) and p(x2 ) of this data set:

import numpy as np
import matplotlib.pyplot as plt
linalg = np.linalg

N = 100
mean = [1,1]
cov = [[0.3, 0.2],[0.2, 0.2]]
data = np.random.multivariate_normal(mean, cov, N)
L = linalg.cholesky(cov)
# print(L.shape)
# (2, 2)
uncorrelated = np.random.standard_normal((2,N))
data2 = np.dot(L,uncorrelated) + np.array(mean).reshape(2,1)
# print(data2.shape)
# (2, 1000)
plt.scatter(data2[0,:], data2[1,:], c='green')    
plt.scatter(data[:,0], data[:,1], c='yellow')
plt.show()

For this you can use the hist function in Matlab or R. How does changing the bin width (or equivalently, the number of bins) affect the plot and the estimate of p(x1 ) and p(x2 )?

I'm using Python, is there something similar to the hist function from Matlab and how to implement it?


Solution

  • The Matlab hist function is implemented in matplotlib as (you guessed it) matplotlib.pyplot.hist. It plots the histogram, taking the number of bins as a parameter. To calculate a histogram without plotting it, use Numpy's numpy.histogram function.

    To estimate a probability distribution, you can use the distributions in scipy.stats. You generated the data above from a normal distribution. To fit a normal distribution to this data you would use scipy.stats.norm.fit. Below is a code example that plots histograms of your data and fits normal distribution to it:

    import numpy as np
    import matplotlib.pyplot as plt
    from scipy.stats import norm
    linalg = np.linalg
    
    N = 100
    mean = [1,1]
    cov = [[0.3, 0.2],[0.2, 0.2]]
    data = np.random.multivariate_normal(mean, cov, N)
    L = linalg.cholesky(cov)
    # print(L.shape)
    # (2, 2)
    uncorrelated = np.random.standard_normal((2,N))
    data2 = np.dot(L,uncorrelated) + np.array(mean).reshape(2,1)
    # print(data2.shape)
    # (2, 1000)
    plt.figure()
    plt.scatter(data2[0,:], data2[1,:], c='green')    
    plt.scatter(data[:,0], data[:,1], c='yellow')
    plt.show()
    
    # Plotting histograms and fitting normal distributions
    plt.subplot(211)
    plt.hist(data[:,0], bins=20, normed=1, alpha=0.5, color='green')
    plt.hist(data2[0,:], bins=20, normed=1, alpha=0.5, color='yellow')
    x = np.arange(-1, 3, 0.001)
    plt.plot(x, norm.pdf(x, *norm.fit(data[:,0])), color='green')
    plt.plot(x, norm.pdf(x, *norm.fit(data2[0,:])), color='yellow')
    plt.title('Var 1')
    
    plt.subplot(212)
    plt.hist(data[:,1], bins=20, normed=1, alpha=0.5, color='green')
    plt.hist(data2[1,:], bins=20, normed=1, alpha=0.5, color='yellow')
    x = np.arange(-1, 3, 0.001)
    plt.plot(x, norm.pdf(x, *norm.fit(data[:,1])), color='green')
    plt.plot(x, norm.pdf(x, *norm.fit(data2[1,:])), color='yellow')
    plt.title('Var 2')
    
    plt.tight_layout()