Search code examples
pythonnumpyhistogramprobabilityhistogram2d

Problems with computing the joint probability mass function with np.histogram2d


I currently have a 4024 by 10 array - where column 0 represent the 4024 different returns of stock 1, column 1 the 4024 returns of stock 2 and so on - for an assignment for my masters where I'm asked to compute the entropy and joint entropy of the different random variables (each random variable obviously being the stock returns). However, these entropy calculations both require the calculation of P(x) and P(x,y). So far I've managed to successfully compute the individual empirical probabilities using the following code:

def entropy(ret,t,T,a,n):

returns=pd.read_excel(ret)
returns_df=returns.iloc[t:T,:]
returns_mat=returns_df.as_matrix()
asset_returns=returns_mat[:,a]
hist,bins=np.histogram(asset_returns,bins=n)
empirical_prob=hist/hist.sum()
entropy_vector=np.empty(len(empirical_prob))

for i in range(len(empirical_prob)):
    if empirical_prob[i]==0:
        entropy_vector[i]=0
    else:
        entropy_vector[i]=-empirical_prob[i]*np.log2(empirical_prob[i])

shannon_entropy=np.sum(entropy_vector)

return shannon_entropy, empirical_prob

P.S. ignore the whole entropy part of the code

As you can see I've simply done the 1d histogram and then divided each count by the total sum of the histogram results in order to find the individual probabilities. However, I'm really struggling with how to go about computing P(x,y) using

np.histogram2d()

Now, obviously P(x,y)=P(x)*P(y) if the random variables are independent, but in my case they are not, as these stocks belong to the same index, and therefore posses some positive correlation, i.e. they're dependent, so taking the product of the two individual probabilities does not hold. I've tried following the suggestions of my professor, where he said:

"We had discussed how to get the empirical pdf for a univariate distribution: one defines the bins and then counts simply how many observations are in the respective bin (relative to the total number of observations). For bivariate distributions you can do the same, but now you make 2-dimensional binning (check for example the histogram2 command in matlab)"

As you can see he's referring to the 2d histogram function of MATLAB, but I've decided to do this assignment on Python, and so far I've elaborated the following code:

def jointentropy(ret,t,T,a,b,n):

returns=pd.read_excel(ret)
returns_df=returns.iloc[t:T,:]
returns_mat=returns_df.as_matrix()
assetA=returns_mat[:,a]
assetB=returns_mat[:,b]
hist,bins1,bins2=np.histogram2d(assetA,assetB,bins=n)

But I don't know what to do from here, because

np.histogram2d()

returns a 4025 by 4025 array as well as the two separate bins, so I don't know what I can do to compute P(x,y) for my two dependent random variables.

I've tried to figure this out for hours without any luck or success, so any kind of help would be highly appreciated! Thank you very much in advance!


Solution

  • Looks like you've got a clear case of conditional or Bayesian probability on your hands. You can look it up, for example, here, http://www.mathgoodies.com/lessons/vol6/dependent_events.html, which gives the probability of both events occurring as P(x,y) = P(x) · P(x|y), where P(x|y) is "probability of event x given y". This should apply in your situation because, if two stocks are from the same index, one price cannot happen without the other. Just build two separate bins like you did for one and calculate probabilities as above.