python global distribution data-fitting symfit

Global distribution fitting sharing some parameters without any specification of the bin size in python

I have several data sets that fit separately very well to a vonMises distribution. I am looking for a way of fitting all of them sharing mu but with different kappas without to care about the election of the bins.

When one wants to fit with only one model it is quite trivial: scipy here does the fit with the raw data. But I have been looking for global fitting using symfit or lmfit, or in some posts (here and here), and in all cases we have to specify x-coordinates and y-coordinates, which means with have previously to choose some bin size for the distribution.

This is some artificial data for only two data sets that could be useful as a example of what I need, though fitting each individually, using scipy. (Note that I don't need to care about the election of the bins).

import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt

# creating the data
mu1, mu2 = .05, -.05
sigma1, sigma2 = 3.1, 2.9
n1, n2 = 8000, 9000
y1 = np.random.vonmises(mu1, sigma1, n1)
y2 = np.random.vonmises(mu2, sigma2, n2)

# fitting
dist = st.vonmises
*args1, loc1, scale1 = dist.fit(y1, fscale=1)
*args2, loc2, scale2 = dist.fit(y2, fscale=1)

x1 = np.linspace(np.min(y1), np.max(y1), 200)
x2 = np.linspace(np.min(y2), np.max(y2), 200)

pdf_fitted1 = dist.pdf(x1, *args1, loc=loc1, scale=scale1)
pdf_fitted2 = dist.pdf(x2, *args2, loc=loc2, scale=scale2)

# plotting
plt.hist(y1, bins=40, density=True, histtype='step', color='#1f77b4')
plt.hist(y2, bins=40, density=True, histtype='step', color='#ff7f0e')
plt.plot(x1, pdf_fitted1, color='#1f77b4')
plt.plot(x2, pdf_fitted2, color='#ff7f0e')
plt.show()

I'd be glad if someone could help with that, thanks in advance. Any answer or comment would be appreciated.

Solution

Thank you for this excellent question. In principle you should be able to solve this using symfit out of the box, but your question brought to light some minor issues. symfit is very much a project in flux, so unfortunately this will happen from time to time. But I created a workaround that should work on the current master branch, and I hope to release a new version soon which will address these issues.

In principle this is a combination of the global fitting examples you already found with the LikeLihood objective function. With LogLikelihood, you don't need to bin but instead use the measurements directly. However, LikeLihood does not seem to handle multi component models properly yet so I included a fixed version of LogLikelihood.

import matplotlib.pyplot as plt
from symfit import (
    Fit, parameters, variables, exp, cos, CallableModel, pi, besseli
)
from symfit.core.objectives import LogLikelihood
from symfit.core.minimizers import *
from symfit.core.printing import SymfitNumPyPrinter

# symbolic bessel is not converted into numerical yet, this monkey-patches it.
def _print_besseli(self, expr):
    return 'scipy.special.iv({}, {})'.format(*expr.args)

SymfitNumPyPrinter._print_besseli = _print_besseli

# creating the data
mu1, mu2 = .05, -.05  # Are these supposed to be opposite sign?
sigma1, sigma2 = 3.5, 2.5
n1, n2 = 8000, 9000
np.random.seed(42)
x1 = np.random.vonmises(mu1, sigma1, n1)
x2 = np.random.vonmises(mu2, sigma2, n2)

# Create a model for `n` different datasets.
n = 2
x, *xs = variables('x,' + ','.join('x_{}'.format(i) for i in range(1, n + 1)))
ys = variables(','.join('y_{}'.format(i) for i in range(1, n + 1)))
mu, kappa = parameters('mu, kappa')
kappas = parameters(','.join('k_{}'.format(i) for i in range(1, n + 1)),
                    min=0, max=10)
mu.min, mu.max = - np.pi, np.pi  # Bound to 2 pi

# Create a model template, who's symbols we will replace for each component.
template = exp(kappa * cos(x - mu)) / (2 * pi * besseli(0, kappa))
model = CallableModel(
    {y_i: template.subs({kappa: k_i, x: x_i}) for y_i, x_i, k_i in zip(ys, xs, kappas)}
)
print(model)

class AlfredosLogLikelihood(LogLikelihood):
    def __call__(self, *args, **kwargs):
        evaluated_func = super(LogLikelihood, self).__call__(
            *args, **kwargs
        )

        ans = - sum([np.nansum(np.log(component))
                     for component in evaluated_func])
        return ans

fit = Fit(model, x_1=x1, x_2=x2, objective=AlfredosLogLikelihood)

x_axis = np.linspace(- np.pi, np.pi, 101)
fit_result = fit.execute()
print(fit_result)
x1_result, x2_result = model(x_1=x_axis, x_2=x_axis, **fit_result.params)

# plotting
plt.hist(x1, bins=40, density=True, histtype='step', color='#1f77b4')
plt.hist(x2, bins=40, density=True, histtype='step', color='#ff7f0e')
plt.plot(x_axis, x1_result, color='#1f77b4')
plt.plot(x_axis, x2_result, color='#ff7f0e')
plt.show()

This outputs the following:

[y_1(x_1; k_1, mu) = exp(k_1*cos(mu - x_1))/(2*pi*besseli(0, k_1)),
 y_2(x_2; k_2, mu) = exp(k_2*cos(mu - x_2))/(2*pi*besseli(0, k_2))]

Parameter Value        Standard Deviation
k_1       3.431673e+00 None
k_2       2.475649e+00 None
mu        1.030791e-02 None
Status message         b'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH'
Number of iterations   13

I hope this get's you in the right direction, and thank you for bringing this bug to light ;).