I'm trying really hard to do a Gaussian Mixture with sklearn but I think I'm missing something because it definitively doesn't work.
My original datas look like this:
Genotype LogRatio Strength
AB 0.392805 10.625016
AA 1.922468 10.765716
AB 0.22074 10.405445
BB -0.059783 10.625016
I want to do a Gaussian Mixture with 3 components = 3 genotypes (AA|AB|BB). I know the weight of each genotype, the mean of Log Ratio for each genotype and the mean of Strength for each genotype.
wgts = [0.8,0.19,0.01] # weight of AA,AB,BB
means = [[-0.5,9],[0.5,9],[1.5,9]] # mean(LogRatio), mean(Strenght) for AA,AB,BB
I keep columns LogRatio and Strength and create a NumPy array.
datas = [[ 0.392805 10.625016]
[ 1.922468 10.765716]
[ 0.22074 10.405445]
[ -0.059783 9.798655]]
Then I tested the function GaussianMixture from mixture from sklearn v0.18 and tried also the function GaussianMixtureModel from sklearn v0.17 (I still don't see the difference and don't know which one to use).
gmm = mixture.GMM(n_components=3)
OR
gmm = mixture.GaussianMixture(n_components=3)
gmm.fit(datas)
colors = ['r' if i==0 else 'b' if i==1 else 'g' for i in gmm.predict(datas)]
ax = plt.gca()
ax.scatter(datas[:,0], datas[:,1], c=colors, alpha=0.8)
plt.show()
This is what I obtain and this is a good result but it changes every time because initial parameters are calculated differently each run
I would like to initialize my parameters in the gaussianMixture or GMM function but I don't understand how I have to formate my datas: (
It is possible to control the randomness for reproducibility of the results by explicitly seeding the random_state
pseudo random number generator.
Instead of :
gmm = mixture.GaussianMixture(n_components=3)
Do :
gmm = mixture.GaussianMixture(n_components=3, random_state=3)
random_state
must be an int
: I've randomly set it to 3
but you can choose any other integer.
When running multiple times with the same random_state
, you will get the same results.