python statistics covariance gaussian mixture

Mean and covariance of conditional distribution

I have a 10000 X 22 dimensional array (observations x features) and I fit a gaussian mixture with one component as following:

mixture = sklearn.mixture.GaussianMixture(n_components=1, covariance_type='full').fit(my_array)

Then, I want to calculate the mean and the covariance of the conditional distribution of the first two features over the rest as per Bishop's Pattern Recognition and Machine learning equations 2.81 and 2.82 in p.87. What I do is the following:

covariances = mixture.covariances_ # shape = (1, 22, 22) where 1 is the 1 component I fit and 22x22 is the covariance matrix
means = mixture_component.means_ # shape = (1, 22), 22 means; one for each feautre
dependent_data = features[:, 0:2] #shape = (10000, 2)
conditional_data = features[:, 2:] #shape = (10000, 20)
mu_a = means[:, 0:2]  # Mu of the dependent variables
mu_b = means[:, 2:]  # Mu of the independent variables
cov_aa = covariances[0, 0:2, 0:2] # Cov of the dependent vars       
cov_bb = covariances[0, 2:, 2:]  # Cov of independent vars         
cov_ab = covariances[0, 0:2, 2:]                                  
cov_ba = covariances[0, 2:, 0:2]
A = (conditional_data.transpose() - mu_b.transpose())
B = cov_ab.dot(np.linalg.inv(cov_bb))
conditional_mu = mu_a + B.dot(A).transpose()
conditional_cov = cov_aa - cov_ab.dot(np.linalg.inv(cov_bb)).dot(cov_ba)

My problem is that on calculating the conditional_mu and the conditional_cov, I'm getting the following shapes:

conditional_mu.shape
(10000, 2)
conditional_cov.shape
(2,2)

I was expecting that the shape of the conditional_mu should be (1,2) because I only want to find the means of the first two features over the rest. Why am I getting a mean for each observation instead?

Solution

Yes, that is the expected dimension.

For each data point, the independent feature is fixed, and the dependent features follow a normal distribution. Each data point would give a different mean for the dependent feature depending on the independent feature.

Since you have 10000 data points, you should have 10000 means for the dependent feature, each for one data point.