Search code examples
pymc

first use of PyMc fails


I am new to PyMc and would like to know why this code doesn't work. I already spent hours on this but I miss something. Could anyone help me ?

question I want to address:

  • I have a set of Npts measures that show 3 bumps, so I want to model this as the sum of 3 gaussians (assuming the measures are noisy and the gaussian approx is ok) ==> I want to estimate 8 parameters: the relative weights of the bumps (i.e. 2 params), their 3 means and their 3 variances.

  • I want this approach wide enough to be applicable on other sets that may not have the same bumps, so I take loose flat priors.

problem: My code below gives me crappy estimations. what's wrong ? thx

"""
hypothesis: multimodal distrib sum of 3 gaussian distributions

model description:
* p1, p2, p3 are the probabilities for a point to belong to gaussian 1, 2 or 3
 ==> p1, p2, p3 are the relative weights of the 3 gaussians

* once a point is associated with a gaussian,
it is distributed normally according to the parameters mu_i, sigma_i of the gaussian
but instead of considering sigma, pymc prefers considering tau=1/sigma**2

* thus, PyMc must guess 8 parameters: p1, p2, mu1, mu2, mu3, tau1, tau2, tau3

* priors on p1, p2 are flat between 0.1 and 0.9 ==> 'pm.Uniform' variables
with the constraint p2<=1-p1. p3 is deterministic ==1-p1-p2

* the 'assignment' variable assigns each point to a gaussian, according to probabilities p1, p2, p3

* priors on mu1, mu2, mu3 are flat between 40 and 120 ==> 'pm.Uniform' variables

* priors on sigma1, sigma2, sigma3 are flat between 4 and 12 ==> 'pm.Uniform' variables
"""

    import numpy as np
    import pymc as pm

    data = np.loadtxt('distrib.txt')
    Npts = len(data)

    mumin = 40
    mumax = 120
    sigmamin=4
    sigmamax=12

    p1 = pm.Uniform("p1",0.1,0.9)
    p2 = pm.Uniform("p2",0.1,1-p1)
    p3 = 1-p1-p2
    assignment = pm.Categorical('assignment',[p1,p2,p3],size=Npts)
    mu = pm.Uniform('mu',[mumin,mumin,mumin],[mumax,mumax,mumax])
    sigma = pm.Uniform('sigma',[sigmamin,sigmamin,sigmamin],
                       [sigmamax,sigmamax,sigmamax])
    tau = 1/sigma**2

    @pm.deterministic
    def assign_mu(assi=assignment,mu=mu):
        return mu[assi]

    @pm.deterministic
    def assign_tau(assi=assignment,sig=tau):
        return sig[assi]

    hypothesis = pm.Normal("obs", assign_mu, assign_tau, value=data, observed=True)

    model = pm.Model([hypothesis, p1, p2, tau, mu])
    test = pm.MCMC(model)
    test.sample(50000,burn=20000) # conservative values, let's take a coffee... 

    print('\nguess\n* p1, p2 = ',
           np.mean(test.trace('p1')[:]),' ; ',
           np.mean(test.trace('p2')[:]),' ==> p3 = ',
           1-np.mean(test.trace('p1')[:])-np.mean(test.trace('p2')[:]),
           '\n* mu = ',
           np.mean(test.trace('mu')[:,0]),' ; ',
           np.mean(test.trace('mu')[:,1]),' ; ',
           np.mean(test.trace('mu')[:,2]))

    print('why does this guess suck ???!!!')      

I can send the data file 'distrib.txt'. It is ~500 kb and data are plotted below. For instance last run gave me:

p1, p2 = 0.366913192214  ;  0.583816452532  ==> p3 = 0.04927035525400003
mu =  77.541619286  ;  75.3371615466  ;  77.2427165073

while there are obviously bumps around ~55, ~75 and ~90, with probabilities around ~0.2, ~0.5 and ~0.3

histogram of the data (20k points == Npts)


Solution

  • You have the problem described here: Negative Binomial Mixture in PyMC

    The problem is the Categorical variable converges too slowly for the three component distributions to get even close.

    First, we generate your test data:

    data1 = np.random.normal(55,5,2000)
    data2 = np.random.normal(75,5,5000)
    data3 = np.random.normal(90,5,3000)
    data=np.concatenate([data1, data2, data3])
    np.savetxt("distrib.txt", data)
    

    Then we plot the histogram, colored by the posterior group assignment:

    tablebyassignment = [data[np.nonzero(np.round(test.trace("assignment")[:].mean(axis=0)) == i)] for i in range(0,3) ]
    plt.hist(tablebyassingment, bins=30, stacked = True)
    

    Stacked histogram with poor discrimination between clusters This will eventually converge, but not quickly enough to be useful to you.

    You can fix this problem by guessing the values of assignment before starting MCMC:

    from sklearn.cluster import KMeans
    kme = KMeans(3)
    kme.fit(np.atleast_2d(data).T)
    assignment = pm.Categorical('assignment',[p1,p2,p3],size=Npts, value=kme.labels_)
    

    Which gives you: Stacked histogram showing separation of three lumps Using k-means to initialize the categorical may not work all of the time, but it is better than not converging.