python-3.x scipy arguments parameter-passing minimization

How can I pass arguments through a chain of nested functions to calculate a result?

My question is quick, but I have provided a hefty piece of code to better illustrate my problem since I have not understood the answer from reading related posts.

The code below is to pick optimized parameters that are part of an args list. The args list should be a single entry like x0 on scipy docs. I am hoping to find the right combination of args to fit the data the best. The scipy optimize modules are supposed to fluctuate the values of my args to find the combination that minimizes my error the greatest. But, I am having trouble passing args from one function to another.

Sometimes I put a * or ** but my success rate is more miss than hit. I want to know how to pass args from one function to another while allowing them to change value so as to find their optimized value. (The optimized value reduces error, explained below). I have a few functions that serve as inputs in other functions and am missing a key concept here. Are kwargs necessary for something like this? If the args are a tuple, can they still change value to find optimized parameters? I'm aware that somewhat similar questions have been asked here on SO but I haven't been able to figure it with these resources yet.

The code is explained below (after imports).

import numpy as np
import random
import matplotlib.pyplot as plt
from math import exp
from math import log
from math import pi
from scipy.integrate import quad ## integrate f(x)dx from x_i to x_i+1
from scipy.stats import norm
from scipy.stats import chisquare
from scipy.optimize import basinhopping
from scipy.stats import binned_statistic as bstat

I generated a random Gaussian distribution sample of 1000 data points, with the average mu = 48 and the standard deviation sigma = 7. I can histogram the data, and my goal is to find the parameters mu, sigma, and normc (scaling factor or normalization constant) that produce the best fit to a histogram of the sample data. There are many error analysis methods but for my purpose, the best fit is determined as the fit that minimizes Chi-Square (described a little further below). I know the code is long (too long even), but my question requires a bit of setup.

## generate data sample
a, b = 48, 7 ## mu, sigma
randg = []
for index in range( 1000 ):
    randg.append( random.gauss(a,b) )
data = sorted( randg )

small = min( data )
big = max( data )
domain = np.linspace(small,big,3000) ## for fitted plot overlay on histogram of data

I then organized my bins for the histogram.

numbins = 30 ## number of bins

def binbounder( small , big , numbins ):
    ## generates list of bound bins for histogram ++ bincount
    binwide = ( big - small ) / numbins ## binwidth
    binleft = [] ## left edges of bins
    for index in range( numbins ):
        binleft.append( small + index * binwide )
    binbound = [val for val in binleft]
    binbound.append( big ) ## all bin edges
    return binbound

binborders = binbounder( small , big , numbins )
## useful if one performs plt.hist(data, bins = binborders, ...)

def binmidder( small , big , numbins ):
    ## all midtpts of bins
    ## for x-ticks on histogram
    ## useful to visualize over/under -estimate of error
    binbound = binbounder( small , big , numbins )
    binwide = ( big - small ) / numbins
    binmiddles = []
    for index in range( len( binbound ) - 1 ):
        binmiddles.append( binwide/2 + index * binwide )
    return binmiddles

binmids = binmidder( small , big , numbins )

To perform Chi-Square analysis, one must input the expectation values per bin (E_i) and multiplicities of observed values per bin (O_i) and output the sum over all the bins of the square of their difference over the expectation value per bin.

def countsperbin( xdata , args = [ small , big , numbins ]):
    ## calculates multiplicity of observed values per bin
    binborders = binbounder( small , big , numbins )
    binmids = binmidder( small , big , numbins )
    values = sorted( xdata ) ## function(xdata) ~ f(x)
    bincount = []
    for jndex in range( len( binborders ) ):
        if jndex != len( binborders ) - 1:
            summ = 0
            for val in values:
                if val > binborders[ jndex ] and val <= binborders[ jndex + 1 ]:
                    summ += 1
            bincount.append( summ )
        if jndex == len( binborders ) - 1:
            pass
    return bincount

obsperbin = countsperbin( binborders , data ) ## multiplicity of observed values per bin

Each expectation value per bin, which is needed to calculate and minimize Chi Squared, is defined as the integral of the distribution function from x_i = left binedge to x_i+1 = right binedge.

I want a reasonable initial guess for my optimized parameters, as these will give me a reasonable guess for a minimized Chi Squared. I choose mu, sigma, and normc to be close to but not equal to their true values so that I can test if the minimization worked.

def maxbin( perbin ):
    ## perbin is a list of observed data per bin
    ## returns largest multiplicity of observed values with index
    ## useful to help guess scaling factor "normc" (outside exponential in GaussDistrib)
    for index, maxval in enumerate( perbin ):
        if maxval == max( perbin ):
            optindex = index
    return optindex, perbin[ optindex ] 

mu, sigma, normc = np.mean( data ) + 30, np.std( data ) + 20, maxbin( obsperbin )

Since we are integrating f(x)dx, the data points (or xdata) is irrelevant here.

def GaussDistrib( xdata , args = [ mu , sigma , normc ] ): ## G(x)
    return normc * exp( (-1) * (xdata - mu)**2 / (2 * sigma**2) )

def expectperbin( args ):
    ## calculates expectation values per bin
    ## needed with observation values per bin for ChiSquared
    ## expectation value of single bin is equal to area under Gaussian curve from left binedge to right binedge
    ## area under curve for ith bin = integral G(x)dx from x_i (left edge) to x_i+1 (right edge)
    ans = []
    for index in range(len(binborders)-1): # ith index does not exist for rightmost boundary
        ans.append( quad( GaussDistrib , binborders[ index ] , binborders[ index + 1 ], args = [ mu , sigma , normc ])[0])
    return ans

My defined function chisq calls chisquarefrom the scipy module to return a result.

def chisq( args ):
    ## args[0] = mu
    ## args[1] = sigma
    ## args[2] = normc
    ## last subscript [0] gives chi-squared value, [1] gives 0 ≤ p-value ≤ 1
    ## can also minimize negative p-value to find best fitting chi square
    return chisquare( obsperbin , expectperbin( args[0] , args[1] , args[2] ))[0]

I do not know how but I would like to place constraints on my system. Specifically, the max of the list of heights of the binned data must be greater than zero (as must Chi Square due to the exponential term that remains after differentiating).

def miniz( chisq , chisqguess , niter = 200 ):
    minimizer = basinhopping( chisq , chisqguess , niter = 200 )
    ## Minimization methods available via https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.html
    return minimizer

expperbin = expectperbin( args = [mu , sigma , normc] )
# chisqmin = chisquare( obsperbin , expperbin )[0]
# chisqmin = result.fun

""" OPTIMIZATION """

print("")
print("initial guess of optimal parameters")

initial_mu, initial_sigma, initial_normc = np.mean(data)+30 , np.std(data)+20 , maxbin( (obsperbin) )
## check optimized result against:  mu = 48, sigma = 7 (via random number generator for Gaussian Distribution)

chisqguess = chisquare( obsperbin , expectperbin( args[0] , args[1] , args[2] ))[0]
## initial guess for optimization

result = miniz( chisqguess, args = [mu, sigma, normc] )
print(result)
print("")

The point of the minimization was to find the optimized parameters that give the best fit.

optmu , optsigma , optnormc = result.x[0], abs(result.x[1]), result.x[2]

chisqcheck = chisquare(obsperbin, expperbin)
chisqmin = result.fun
print("chisqmin --  ",chisqmin,"        ",chisqcheck," --   check chi sq")

print("""
""")

## CHECK
checkbins = bstat(xdata, xdata, statistic = 'sum', bins = binborders) ## via SCIPY (imports)
binsum = checkbins[0]
binedge = checkbins[1]
binborderindex = checkbins[2]
print("binsum",binsum)
print("")
print("binedge",binedge)
print("")
print("binborderindex",binborderindex)
# Am I doing this part right?

tl;dr: I want result, which calls the function minimiz, which calls a scipy module to minimize Chi Squared using a guess value. Chi Squared and the guess value each call other functions, etc. How can I pass my args through the right way?

Solution

You can access all information from the OptimizeResult that is returned from optimize.basinhopping.

I've abstracted away the generation of the random sample and reduced the number of your functions to those 5 functions that are really needed to run the optimization.

The only "tricky" part in parameter passing is to pass the parameters mu and sigma to the GaussDistrib function inside the quad call, but that is readily explained in the quad doc. Other than that, I fail to see a real problem with parameter passing here.

Your prolonged use of normc is misguided. You don't get correct values from the Gaussian that way (and there is no need to vary 3 independent parameters when 2 are sufficient). Also, to get the correct values for chi square, you have to multiply the probabilities from the Gaussian with the sample count (you are comparing absolute counts from the obsperbin bins with probabilities under the Gaussian - which is clearly wrong).

from math import exp
from math import pi
from scipy.integrate import quad
from scipy.stats import chisquare
from scipy.optimize import basinhopping


# smallest value in the sample
small = 26.55312337811099
# largest value in the sample
big   = 69.02965763016027

# a random sample from N(48, 7) with 999 sample
# values binned into 30 equidistant bins ranging
# from 'small' (bin[0] lower bound) to 'big'
# (bin[29] upper bound) 
obsperbin = [ 1,  1,  2,  4,  8, 10, 13, 29, 35, 45,
             51, 56, 63, 64, 96, 89, 68, 80, 61, 51,
             49, 30, 34, 19, 22,  3,  7,  5,  1,  2]

numbins = len(obsperbin) #  30
numobs  = sum(obsperbin) # 999

# intentionally wrong guesses of mu and sigma
# to be provided as optimizer's initial values
initial_mu, initial_sigma = 78.5, 27.0


def binbounder( small , big , numbins ):
    ## generates list of bound bins for histogram ++ bincount
    binwide = ( big - small ) / numbins ## binwidth
    binleft = [] ## left edges of bins
    for index in range( numbins ):
        binleft.append( small + index * binwide )
    binbound = [val for val in binleft]
    binbound.append( big ) ## all bin edges
    return binbound

# setup the bin borders
binborders = binbounder( small , big , numbins )


def GaussDistrib( x , mu , sigma ):
    return 1/(sigma * (2*pi)**(1/2)) * exp( (-1) * (x - mu)**2 / ( 2 * (sigma **2) ) )


def expectperbin( musigma ):
    ## musigma[0] = mu
    ## musigma[1] = sigma
    ## calculates expectation values per bin
    ## expectation value of single bin is equal to area under Gaussian
    ## from left binedge to right binedge multiplied by the sample size
    e = []
    for i in range(len(binborders)-1): # ith i does not exist for rightmost boundary
        e.append( quad( GaussDistrib , binborders[ i ] , binborders[ i + 1 ],
                         args = ( musigma[0] , musigma[1] ))[0] * numobs)
    return e


def chisq( musigma ):
    ## first subscript [0] gives chi-squared value, [1] gives 0 = p-value = 1
    return chisquare( obsperbin , expectperbin( musigma ))[0]


def miniz( chisq , musigma ):
    return basinhopping( chisq , musigma , niter = 200 )


## chisquare value for initial parameter guess
chisqguess = chisquare( obsperbin , expectperbin( [initial_mu , initial_sigma] ))[0]

res = miniz( chisq, [initial_mu , initial_sigma] )

print("chisquare from initial guess:" , chisqguess)
print("chisquare after optimization:" , res.fun)
print("mu, sigma after optimization:" , res.x[0], ",", res.x[1])

chisquare from initial guess: 3772.70822797

chisquare after optimization: 26.351284911784447

mu, sigma after optimization: 48.2701027439, 7.046156286

Btw, basinhopping is overkill for this kind of problem. I'd stay with fmin (Nelder-Mead).