Search code examples
numpyscipycurve-fittingnormal-distributionscipy-optimize

How to fit a Normal Distribution where μ is the function p(d)?


I have defined the following Normal Distribution N. Here, r is the random variable (you can think of r as "age"), while the mean of N is given by the function P(d) which (as a parameter) is fixing N every time (you can think of d as "height"):

def p(d, a, b):
    return a-b*d

def N(r, d, a, b, s):
    return (1/(s*sqrt(2*pi)))*exp(-(1/2)*((r-p(d, a, b))/s)**2)

In other words, for different values of d (height), N becomes a different PDF (shaped by a, b and s) that describes the random variable r (age).

I have many (18 million) d, r pairs and I would like to fit the PDF on these data, finding the optimal a, b and s.

How can I do that?


Solution

  • So you want to find the parameters a, b and s that maximizes the likelihood of the data? So I can assume your loss function will be the product of N(r, d, a, b, s) given the r and d on your data. There are many optimization methods, given that these functions are differentiable, you can even use an autograd framework like Tensorflow or PyTorch. But for simplicity I will use scipy as you tagged it, it should be fine if your data is small (<1000).

    import numpy as np
    import scipy.optimize
    from numpy import pi, sqrt, exp, log
    
    def p(d, a, b):
        return a-b*d
    
    def N(r, d, a, b, s): # Writen as numpy-friendly (accepts numpy arrays as inputs)
        return (1/(s*sqrt(2*pi)))*exp(-(1/2)*((r-p(d, a, b))/s)**2)
    
    def minus_log_likelihood(p): # params, a, b, s. Log sum is equivalent to product
        return -np.sum(log(N(dataset[:, 0], dataset[:, 1], p[0], p[1], p[2])))
    
    dataset = np.random.uniform(size=(100, 2)) # 100 points with d and r values
    res = scipy.optimize.minimize(minus_log_likelihood, [0, 0, 1])