I have calculated loads for bridges and I want to fit the Gumbel's distribution to highest 20% of them using maximum likelihood estimate. I need help calculating parameters for the distribution. I have read through scipy.optimize documentation but I can't uderstand how to apply functions in there for estimating two parameter function.
Here is a bit of theory that might help: There are two likelihood functions (L1 and L2), one for values above some threshold (x>=C) and one for values below (x < C), now the likeliest parameters are those at the max value of mutiplication between two functions max(L1*L2). In this case the L1 is still the product of multiplication of probability density function's values at xi, but the L2 is the probability that threshold value C will be exceeded (1-F(C)).
Here's some code I have written:
non_truncated_data = ([15.999737471905252, 16.105716234887431, 17.947809230275304, 16.147752064149291, 15.991427126788327, 16.687542227378565, 17.125139229445359, 19.39645340792385, 16.837044960487795, 15.804473320190725, 16.018569387471025, 16.600876724289019, 16.161306985203151, 17.338636901595873, 18.477371969176406, 17.897236722220281, 16.626465201654593, 16.196548622931672, 16.013794215070927, 16.30367884232831, 17.182106070966608, 18.984566931768452, 16.885737663740024, 16.088051117522948, 15.790480003140173, 18.160947973898388, 18.318158853376037])
threshold = 15.78581825859324
def maximum_likelihood_function(non_truncated_loads, threshold, loc, scale):
"""Calculates maximum likelihood function's value for given truncated data
with given parameters.
Maximum likelihood function for truncated data is L1 * L2. Where L1 is a
product of multiplication of pdf values at non-truncated known values
(non_truncated_values). L2 is a the probability that threshold value will
be exceeded.
"""
is_first = True
# calculates L1
for x in non_truncated_loads:
if is_first:
L1 = gumbel_pdf(x, loc, scale)
is_first = False
else:
L1 *= gumbel_pdf(x, loc, scale)
# calculates L2
cdf_at_threshold = gumbel_cdf(threshold, loc, scale)
L2 = 1 - cdf_at_threshold
return L1*L2
def gumbel_pdf(x, loc, scale):
"""Returns the value of Gumbel's pdf with parameters loc and scale at x .
"""
# exponent
e = math.exp(1)
# substitute
z = (x - loc)/scale
return (1/scale) * (e**(-(z + (e**(-z)))))
def gumbel_cdf(x, loc, scale):
"""Returns the value of Gumbel's cdf with parameters loc and scale at x.
"""
# exponent
e = math.exp(1)
return (e**(-e**(-(x-loc)/scale)))
First of all, the easiest way of optimizing a function using scipy.optimize
is to construct the target function such that the first argument is a list of the parameters need to be optimized and the following arguments specify other things, such as data and fixed parameters.
Second, it will be very helpful to use the vectorization provided by numpy
Therefore we have these:
In [61]:
#modified pdf and cdf
def gumbel_pdf(x, loc, scale):
"""Returns the value of Gumbel's pdf with parameters loc and scale at x .
"""
# substitute
z = (x - loc)/scale
return (1./scale) * (np.exp(-(z + (np.exp(-z)))))
def gumbel_cdf(x, loc, scale):
"""Returns the value of Gumbel's cdf with parameters loc and scale at x.
"""
return np.exp(-np.exp(-(x-loc)/scale))
In [62]:
def trunc_GBL(p, x):
threshold=p[0]
loc=p[1]
scale=p[2]
x1=x[x<threshold]
nx2=len(x[x>=threshold])
L1=(-np.log((gumbel_pdf(x1, loc, scale)/scale))).sum()
L2=(-np.log(1-gumbel_cdf(threshold, loc, scale)))*nx2
#print x1, nx2, L1, L2
return L1+L2
In [63]:
import scipy.optimize as so
In [64]:
#first we make a simple Gumbel fit
so.fmin(lambda p, x: (-np.log(gumbel_pdf(x, p[0], p[1]))).sum(), [0.5,0.5], args=(np.array(non_truncated_data),))
Optimization terminated successfully.
Current function value: 35.401255
Iterations: 70
Function evaluations: 133
Out[64]:
array([ 16.47028986, 0.72449091])
In [65]:
#then we use the result as starting value for your truncated Gumbel fit
so.fmin(trunc_GBL, [17, 16.47028986, 0.72449091], args=(np.array(non_truncated_data),))
Optimization terminated successfully.
Current function value: 0.000000
Iterations: 25
Function evaluations: 94
Out[65]:
array([ 13.41111111, 16.65329308, 0.79694 ])
Here in the trunc_GBL
function I substituted your pdf with a scaled pdf
See rationales here, basically is it because your L1
is pdf based and L2
is cdf based: http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_lifereg_sect018.htm
Then we notice a problem see: Current function value: 0.000000
in the last output. The negative loglikelihood function is 0.
This is because:
In [66]:
gumbel_cdf(13.41111111, 16.47028986, 0.72449091)
Out[66]:
2.3923515777163676e-30
Effectively 0. This means based on the model you just described, maximum is always reached when the threshold value is low enough such that L1
is non-exsit (x < threshold
is empty) and L2
is 1 (1-F(C)
is 1
, for all items in your data).
For this reason, your model does not look all that right to me. You may want to rethink about it.
We can further isolate threshold
and treat it as a fixed parameter:
def trunc_GBL(p, x, threshold):
loc=p[0]
scale=p[1]
x1=x[x<threshold]
nx2=len(x[x>=threshold])
L1=(-np.log((gumbel_pdf(x1, loc, scale)/scale))).sum()
L2=(-np.log(1-gumbel_cdf(threshold, loc, scale)))*nx2
#print x1, nx2, L1, L2
return L1+L2
And call the optimizer differently:
so.fmin(trunc_GBL, [0.5, 0.5], args=(X, np.percentile(X, 20)))
Optimization terminated successfully.
Current function value: 20.412818
Iterations: 72
Function evaluations: 136
Out[9]:
array([ 16.34594943, 0.45253201])
This way if you want 70% quantile, you can simply changed it to np.percentile(X, 30)
and so on. np.percentile()
is just another way of doing .quantile(0.8)