Search code examples
algorithmmachine-learningdata-miningexpectation-maximization

How to compute the variances in Expectation Maximization with n dimensions?


I have been reviewing Expectation Maximization (EM) in research papers such as this one:

http://pdf.aminer.org/000/221/588/fuzzy_k_means_clustering_with_crisp_regions.pdf

I have some doubts that I have not figured it out. For example, what would happen if we have many dimensions for each datapoint?

For example I have the following dataset with 6 datapoints and 4 dimensions:

>D1 D2 D3  D4   
5, 19, 72, 5  
6, 18, 14, 1  
7, 22, 29, 4   
3, 22, 51, 1   
2, 21, 89, 2   
1, 12, 28, 1

It means that for computing the expectation step, do I need to compute 4 standard deviations (one for each dimension)?

Do I also have to compute the variance for each cluster assuming k=3 (Do not know if it is necessary based on the formula from the paper...) or just the variances for each dimensions (4 attributes)?


Solution

  • Usually, you use a Covariance matrix, which also includes variances.

    But it really depends on your chosen model. The simplest model does not use variances at all. A more complex model has a single variance value, the average variance over all dimensions. Next, you can have a separate variance for each dimension independently; and last but not least a full covariance matrix. That is probably the most flexible GMM in popular use.

    Depending on your implementation, there can be many more.

    From R's mclust documentation:

    univariate mixture

    "E" = equal variance (one-dimensional)
    "V" = variable variance (one-dimensional)

    multivariate mixture

    "EII" = spherical, equal volume
    "VII" = spherical, unequal volume
    "EEI" = diagonal, equal volume and shape
    "VEI" = diagonal, varying volume, equal shape
    "EVI" = diagonal, equal volume, varying shape
    "VVI" = diagonal, varying volume and shape
    "EEE" = ellipsoidal, equal volume, shape, and orientation
    "EEV" = ellipsoidal, equal volume and equal shape
    "VEV" = ellipsoidal, equal shape
    "VVV" = ellipsoidal, varying volume, shape, and orientation

    single component

    "X" = univariate normal
    "XII" = spherical multivariate normal
    "XXI" = diagonal multivariate normal
    "XXX" = elliposidal multivariate normal