Search code examples
pythonstatisticslogistic-regressionstatsmodelsstatistical-test

Python package for getting the maximum likelihood estimator for logistic regression


As this post is long, I will put my questions here:

1. Is there a package in python that will give me the maximum likelihood estimator parameters, for a given number of parameters p, for the covariates x and the data values y? (Preferably with comprehensive documentation on how to implement it)

2. Is this method scalable i.e. will work if I try O(100) covariates and similar number of data samples?

Background:

I am trying to investigate things like the distribution of the maximum likelihood estimators with varying number of samples n /covariates p using python. My script generates the data for logistic regression just fine, but I have been unable to get any method of parameter estimation (i.e. the parameter values maximising the log likelihood) to work correctly.

Approaches I have tried:

-coding up my own version of Newton Raphson procedure. But the errors in my estimates were diverging on repeated iterations (I checked for obvious sign and inequality errors of course!)

-using Newton conjugate gradient implementation. This also failed, because of the error 'a truth value of an array is ambiguous'. When I was coding up my own version, I could use all() to cure this, but not when using a package. Strange because I thought Newton CG was meant to work for multivariate case!

-Finally, I am tying to use the package statsmodels. I don't know if I am being really obtuse, but I can't find any comprehensive documentation on this? For getting logistic regression paramters, the best I have found is this, from which I have tried e.g.

X,y = logit_data(np.power(10,6),p,theta)

y=np.reshape(y, (len(y),))

clf = LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial').fit(X, y)

thetaEst = clf.get_params(X, y)

and I have tried with the last line as just:

thetaEst = clf.get_params()

But nothing seems to give me estimates for parameter values. I either have errors or an object that I don't expect. Surely there is a python package that should do this?!? Surely I shouldn't have to resort to R (I don't know R D: !!!)

Optional Code

I don't want to make this post longer, but I'm sure it will be asked for. So I have put in my code for implementing Newton Raphson, and I have been substituing this function with existing packages:

#Script for producing y data from p covariates from a specified distribution, specified theta paraemters,
#and n data samples for logit link function.

import numpy as np
import matplotlib.pyplot as plt


#Define link function here
def g(z):
    g=1/(1+np.exp(-z))
    return g

#For producing y data values given true paramters theta and number of covariates
def logit_data(n,p, theta):
    #Define parameters

    #1)Number of covariates
    p_i = p+1  #with intercept
    p_i=np.int(p_i)

    #2) m as correct data type
    n=np.int(n)

    #4)Specify parameter valueas to be estimated
    theta=np.reshape(theta, (p_i,1))

    #5)Define distribution from which covariate values are drawn i.i.d., and initiate data values
    X=np.zeros((n,p_i))
    X[:,0]=1    #intercept
    mean=0
    sigma=1.5

    X[:,1:]=np.random.normal(mean,sigma,(n,p))


    #6)Produce y values treating y as a Bernoulli variable with p=g(X*theta)
    r=np.random.uniform(0,1,n)
    r=np.reshape(r, (len(r),1))
    htrue=g(X.dot(theta))
    y=htrue-r
    y[y>=0]=1
    y[y<0]=0

    return X, y





#Newton Raphson implementation
def NewtonRaphson(X,y):
    ##NOTE: All functions negloglikelihood, gradf, hessian, return the values for f = (-ve of the log likelihood function)
    #, to use NR method to minimise f (rather than maximise l)

    #Define log likelihood function to be maximised
    def negloglikelihood(y,h):
        l= y.transpose() @ np.log(h) + (1-y).transpose() @ np.log(1-h)
        f=-l
        return f

    #Define gradient of log likelihood function
    def gradf(y, h, X):
        a=(y-h).transpose()
        gradl= np.matmul(a,X)
        grad_f=-gradl
        return grad_f

    #Define second derivative (Hessian) of log likelihood function
    def hessian(h, X):
        D=np.identity(len(h))*(np.matmul(h,(1-h).transpose()))
        H=-X.transpose() @ D @ X
        Hf=-H
        return Hf


    #Minimise f=-l

    #Produce initial theta estimate and probability parameter h
    np.random.seed(555)
    thetaEst=np.random.normal(1.1, 0.4, 6)

    eta=np.matmul(X,thetaEst)
    h=g(eta)

    #While not at a manimum of f

    #control constants
    a=10e-8
    b=10e-8
    i=0
    j=0
    k=0

    while not (np.linalg.norm(gradf(y,h,X)) < np.absolute(negloglikelihood(y,h)) * a + b):
        i=i+1
        #print('i = %s' %i)
        h=g(np.matmul(X,thetaEst)) 
        H=hessian(h,X)      #Cholesky decomposition to ensure hessian (of f) is positive semi-definite)
       # print(H)
        try:
            np.linalg.cholesky(H)
            #print('j = %s' %j)
            j=j+1
        except np.linalg.LinAlgError:
            print('Hessian not positive semi-definite!')
            try:
                v,w=np.linalg.eig(H)
               # print(v,w)
                v=np.absolute(v)
                H=w @ np.diag(v) @ np.linalg.inv(w)
            except:
                return thetaEst


        delta = 0

        try:
            delta=np.linalg.solve(H, np.reshape(gradf(y,h,X),(6,1))) #Solve for incremental theta step
            #print('k = %s' %k)
            k=k+1
        except:
            return thetaEst   #Simply return theta estimate if have singular hessian

        while negloglikelihood(y, h) > negloglikelihood(y, g(np.matmul(X,thetaEst+delta))):
            print('!!')
            delta=0.5*delta               #Ensure added step is sufficently small so as not to diverge

        thetaEst=thetaEst+delta

    return thetaEst









#Main control

#1)Sample numbers to test for erros in beta, as powers of 10.
npowers=np.arange(1,2,0.05)
n=np.power(10,npowers)


#2)Number of independent covariates
p=5


#3)True theta to be estimated (parameter values)
theta=np.asarray([1,1.2,1.1,0.8,0.9,1.3])


#4)#Initiate arrays to store estimates of theta (and errors) computed at specified sample numbers N
Thetas=np.zeros((len(npowers),p+1))
Errors=np.zeros((len(npowers),p+1))


#5)Obtain random covariate values from specified distribution, and corresponding y values using true theta
#plus gaussian noise term.
X,y = logit_data(n[-1],p,theta)


#6)Calulcate cumulative means for given n values, for the theta estimates
for ind,N in enumerate(n):
    N=np.int(N)
    thetaTemp=NewtonRaphson(X[0:N,:],y[0:N])
    Thetas[ind,:] = np.reshape(thetaTemp,6)



#7)Calculate true erros 
#print(Thetas)
Errors=Thetas-theta.transpose()
absError=np.abs(Errors)
nerror=Errors*np.sqrt(n)[:,np.newaxis]
logerror = np.log10(absError)

#8)Save data as csv


#9)Plots

plt.scatter(X@theta, g(X@theta))
plt.scatter(X@theta,y)
plt.show()



fig=plt.figure()
for i in range(p+1):
    plt.plot(npowers, logerror[:,i])

fig.suptitle('log10(Absolute Error) in MLE against log10(Number of samples,N) for logistic regression')
plt.xlabel('log_10(N)')
plt.ylabel('log_10(Absolute Error)')

fig.savefig('logiterrors7.png')
plt.show()

Solution

  • Documentation on the logistic regression model in statsmodels may be found here, for the latest development version. All models follow a familiar series of steps, so this should provide sufficient information to implement it in practice (do make sure to have a look at some examples, e.g. here). I would not suggest you go about re-implementing solvers/models already made available in scipy or statsmodels in general, unless you have a very specific need.

    Now, I've used your script to generate some data and the Logit model to estimate the parameters, as follows,

    import statsmodels.api as sm
    
    X, y = logit_data(n[-1], p, theta)
    
    model = sm.Logit(y, X)
    result = model.fit()
    
    print(result.summary())
    

    This outputs (your mileage may vary),

    Optimization terminated successfully.
             Current function value: 0.203609
             Iterations 9
                               Logit Regression Results                           
    ==============================================================================
    Dep. Variable:                      y   No. Observations:                   89
    Model:                          Logit   Df Residuals:                       83
    Method:                           MLE   Df Model:                            5
    Date:                Wed, 17 Jul 2019   Pseudo R-squ.:                  0.7062
    Time:                        13:40:37   Log-Likelihood:                -18.121
    converged:                       True   LL-Null:                       -61.684
                                            LLR p-value:                 2.695e-17
    ==============================================================================
                     coef    std err          z      P>|z|      [0.025      0.975]
    ------------------------------------------------------------------------------
    const          1.0735      0.540      1.986      0.047       0.014       2.133
    x1             2.0890      0.594      3.518      0.000       0.925       3.253
    x2             1.7191      0.459      3.746      0.000       0.820       2.618
    x3             1.7228      0.464      3.713      0.000       0.813       2.632
    x4             1.1897      0.410      2.902      0.004       0.386       1.993
    x5             2.2008      0.653      3.370      0.001       0.921       3.481
    ==============================================================================
    
    Possibly complete quasi-separation: A fraction 0.10 of observations can be
    perfectly predicted. This might indicate that there is complete
    quasi-separation. In this case some parameters will not be identified.
    

    These coefficients may be accessed via result.params, as follows,

    >>>result.params
    [1.0734945  2.08898192 1.71907914 1.72278748 1.18972079 2.20079805]