Histogram based probability density estimation

How do I make a histogram based probability density estimate of each of the marginal distributions p(x1 ) and p(x2 ) of this data set:

import numpy as np
import matplotlib.pyplot as plt
linalg = np.linalg

N = 100
mean = [1,1]
cov = [[0.3, 0.2],[0.2, 0.2]]
data = np.random.multivariate_normal(mean, cov, N)
L = linalg.cholesky(cov)
# print(L.shape)
# (2, 2)
uncorrelated = np.random.standard_normal((2,N))
data2 = np.dot(L,uncorrelated) + np.array(mean).reshape(2,1)
# print(data2.shape)
# (2, 1000)
plt.scatter(data2[0,:], data2[1,:], c='green')    
plt.scatter(data[:,0], data[:,1], c='yellow')
plt.show()

For this you can use the hist function in Matlab or R. How does changing the bin width (or equivalently, the number of bins) affect the plot and the estimate of p(x1 ) and p(x2 )?

I'm using Python, is there something similar to the hist function from Matlab and how to implement it?

Solution

The Matlab hist function is implemented in matplotlib as (you guessed it) matplotlib.pyplot.hist. It plots the histogram, taking the number of bins as a parameter. To calculate a histogram without plotting it, use Numpy's numpy.histogram function.

To estimate a probability distribution, you can use the distributions in scipy.stats. You generated the data above from a normal distribution. To fit a normal distribution to this data you would use scipy.stats.norm.fit. Below is a code example that plots histograms of your data and fits normal distribution to it:

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
linalg = np.linalg

N = 100
mean = [1,1]
cov = [[0.3, 0.2],[0.2, 0.2]]
data = np.random.multivariate_normal(mean, cov, N)
L = linalg.cholesky(cov)
# print(L.shape)
# (2, 2)
uncorrelated = np.random.standard_normal((2,N))
data2 = np.dot(L,uncorrelated) + np.array(mean).reshape(2,1)
# print(data2.shape)
# (2, 1000)
plt.figure()
plt.scatter(data2[0,:], data2[1,:], c='green')    
plt.scatter(data[:,0], data[:,1], c='yellow')
plt.show()

# Plotting histograms and fitting normal distributions
plt.subplot(211)
plt.hist(data[:,0], bins=20, normed=1, alpha=0.5, color='green')
plt.hist(data2[0,:], bins=20, normed=1, alpha=0.5, color='yellow')
x = np.arange(-1, 3, 0.001)
plt.plot(x, norm.pdf(x, *norm.fit(data[:,0])), color='green')
plt.plot(x, norm.pdf(x, *norm.fit(data2[0,:])), color='yellow')
plt.title('Var 1')

plt.subplot(212)
plt.hist(data[:,1], bins=20, normed=1, alpha=0.5, color='green')
plt.hist(data2[1,:], bins=20, normed=1, alpha=0.5, color='yellow')
x = np.arange(-1, 3, 0.001)
plt.plot(x, norm.pdf(x, *norm.fit(data[:,1])), color='green')
plt.plot(x, norm.pdf(x, *norm.fit(data2[1,:])), color='yellow')
plt.title('Var 2')

plt.tight_layout()