Search code examples
pythonglobaldistributiondata-fittingsymfit

Global distribution fitting sharing some parameters without any specification of the bin size in python


I have several data sets that fit separately very well to a vonMises distribution. I am looking for a way of fitting all of them sharing mu but with different kappas without to care about the election of the bins.

When one wants to fit with only one model it is quite trivial: scipy here does the fit with the raw data. But I have been looking for global fitting using symfit or lmfit, or in some posts (here and here), and in all cases we have to specify x-coordinates and y-coordinates, which means with have previously to choose some bin size for the distribution.

This is some artificial data for only two data sets that could be useful as a example of what I need, though fitting each individually, using scipy. (Note that I don't need to care about the election of the bins).

import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt

# creating the data
mu1, mu2 = .05, -.05
sigma1, sigma2 = 3.1, 2.9
n1, n2 = 8000, 9000
y1 = np.random.vonmises(mu1, sigma1, n1)
y2 = np.random.vonmises(mu2, sigma2, n2)

# fitting
dist = st.vonmises
*args1, loc1, scale1 = dist.fit(y1, fscale=1)
*args2, loc2, scale2 = dist.fit(y2, fscale=1)

x1 = np.linspace(np.min(y1), np.max(y1), 200)
x2 = np.linspace(np.min(y2), np.max(y2), 200)

pdf_fitted1 = dist.pdf(x1, *args1, loc=loc1, scale=scale1)
pdf_fitted2 = dist.pdf(x2, *args2, loc=loc2, scale=scale2)

# plotting
plt.hist(y1, bins=40, density=True, histtype='step', color='#1f77b4')
plt.hist(y2, bins=40, density=True, histtype='step', color='#ff7f0e')
plt.plot(x1, pdf_fitted1, color='#1f77b4')
plt.plot(x2, pdf_fitted2, color='#ff7f0e')
plt.show()

I'd be glad if someone could help with that, thanks in advance. Any answer or comment would be appreciated.


Solution

  • Thank you for this excellent question. In principle you should be able to solve this using symfit out of the box, but your question brought to light some minor issues. symfit is very much a project in flux, so unfortunately this will happen from time to time. But I created a workaround that should work on the current master branch, and I hope to release a new version soon which will address these issues.

    In principle this is a combination of the global fitting examples you already found with the LikeLihood objective function. With LogLikelihood, you don't need to bin but instead use the measurements directly. However, LikeLihood does not seem to handle multi component models properly yet so I included a fixed version of LogLikelihood.

    import matplotlib.pyplot as plt
    from symfit import (
        Fit, parameters, variables, exp, cos, CallableModel, pi, besseli
    )
    from symfit.core.objectives import LogLikelihood
    from symfit.core.minimizers import *
    from symfit.core.printing import SymfitNumPyPrinter
    
    # symbolic bessel is not converted into numerical yet, this monkey-patches it.
    def _print_besseli(self, expr):
        return 'scipy.special.iv({}, {})'.format(*expr.args)
    
    SymfitNumPyPrinter._print_besseli = _print_besseli
    
    # creating the data
    mu1, mu2 = .05, -.05  # Are these supposed to be opposite sign?
    sigma1, sigma2 = 3.5, 2.5
    n1, n2 = 8000, 9000
    np.random.seed(42)
    x1 = np.random.vonmises(mu1, sigma1, n1)
    x2 = np.random.vonmises(mu2, sigma2, n2)
    
    # Create a model for `n` different datasets.
    n = 2
    x, *xs = variables('x,' + ','.join('x_{}'.format(i) for i in range(1, n + 1)))
    ys = variables(','.join('y_{}'.format(i) for i in range(1, n + 1)))
    mu, kappa = parameters('mu, kappa')
    kappas = parameters(','.join('k_{}'.format(i) for i in range(1, n + 1)),
                        min=0, max=10)
    mu.min, mu.max = - np.pi, np.pi  # Bound to 2 pi
    
    # Create a model template, who's symbols we will replace for each component.
    template = exp(kappa * cos(x - mu)) / (2 * pi * besseli(0, kappa))
    model = CallableModel(
        {y_i: template.subs({kappa: k_i, x: x_i}) for y_i, x_i, k_i in zip(ys, xs, kappas)}
    )
    print(model)
    
    class AlfredosLogLikelihood(LogLikelihood):
        def __call__(self, *args, **kwargs):
            evaluated_func = super(LogLikelihood, self).__call__(
                *args, **kwargs
            )
    
            ans = - sum([np.nansum(np.log(component))
                         for component in evaluated_func])
            return ans
    
    fit = Fit(model, x_1=x1, x_2=x2, objective=AlfredosLogLikelihood)
    
    x_axis = np.linspace(- np.pi, np.pi, 101)
    fit_result = fit.execute()
    print(fit_result)
    x1_result, x2_result = model(x_1=x_axis, x_2=x_axis, **fit_result.params)
    
    # plotting
    plt.hist(x1, bins=40, density=True, histtype='step', color='#1f77b4')
    plt.hist(x2, bins=40, density=True, histtype='step', color='#ff7f0e')
    plt.plot(x_axis, x1_result, color='#1f77b4')
    plt.plot(x_axis, x2_result, color='#ff7f0e')
    plt.show()
    

    This outputs the following:

    [y_1(x_1; k_1, mu) = exp(k_1*cos(mu - x_1))/(2*pi*besseli(0, k_1)),
     y_2(x_2; k_2, mu) = exp(k_2*cos(mu - x_2))/(2*pi*besseli(0, k_2))]
    
    Parameter Value        Standard Deviation
    k_1       3.431673e+00 None
    k_2       2.475649e+00 None
    mu        1.030791e-02 None
    Status message         b'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH'
    Number of iterations   13
    

    enter image description here

    I hope this get's you in the right direction, and thank you for bringing this bug to light ;).