Search code examples
pythoncluster-computinggaussian-mixture-model

Why does a Gaussian Mixture Model make different clusters each run?


I'm using Python to cluster a 5D set of data. And each run generates a different set of clusters. I'm simply curious as to why this is.

Here's the code:

    df = pd.read_csv('database.csv')
    ratios = df.drop(['patient', 'class'], axis=1)
            
    gaussian = GaussianMixture(n_components=7).fit(ratios).predict(ratios)
            
    df['gaussian'] = gaussian
    
    cluster_counts = Counter(df['gaussian'])
    centroids = NearestCentroid().fit(ratios, gaussian).centroids_
    sum_of_distances = np.zeros((len(centroids), 5))

Here's a graph showing the sum of the average distances to the centroid for one run: first graph

And here's a graph for another run:

second graph

You can see that the bar for Gaussian mixture varies from one to another, however, no other clustering algorithms change.

If someone could explain why this happens it would be much appreciated.


Solution

  • MixtureGaussian Documentation You are interested in random_state parameter. Each time you run the model the initialization of the parameters may differ.

    random_state: int, RandomState instance or None, default=None Controls the random seed given to the method chosen to initialize the parameters (see init_params). In addition, it controls the generation of random samples from the fitted distribution (see the method sample). Pass an int for reproducible output across multiple function calls.

    More about random and seed in python: random.seed(): What does it do?