I'm using Python to cluster a 5D set of data. And each run generates a different set of clusters. I'm simply curious as to why this is.
Here's the code:
df = pd.read_csv('database.csv')
ratios = df.drop(['patient', 'class'], axis=1)
gaussian = GaussianMixture(n_components=7).fit(ratios).predict(ratios)
df['gaussian'] = gaussian
cluster_counts = Counter(df['gaussian'])
centroids = NearestCentroid().fit(ratios, gaussian).centroids_
sum_of_distances = np.zeros((len(centroids), 5))
Here's a graph showing the sum of the average distances to the centroid for one run:
And here's a graph for another run:
You can see that the bar for Gaussian mixture varies from one to another, however, no other clustering algorithms change.
If someone could explain why this happens it would be much appreciated.
MixtureGaussian Documentation
You are interested in random_state
parameter. Each time you run the model the initialization of the parameters may differ.
random_state: int, RandomState instance or None, default=None Controls the random seed given to the method chosen to initialize the parameters (see init_params). In addition, it controls the generation of random samples from the fitted distribution (see the method sample). Pass an int for reproducible output across multiple function calls.
More about random and seed in python: random.seed(): What does it do?