Search code examples
numpyscipyscipy.stats

Issue fitting probability distribution with beta-binomial in Scipy 1.9.x: possible breaking change?


I updated the Scipy version in a project from 1.8.1 to 1.9.0 while keeping the Numpy version at 1.24.4. However, this change caused the project to break. The below code snippet reproduces the issue, which involves fitting a probability distribution using the beta-binomial distribution from Scipy.

from collections import Counter

import numpy as np
from scipy import stats
from scipy.optimize import differential_evolution


def func_nll(free_params, *args):
    """Negative log likelihood used in the optimizer to fit the best probability distribution available.
    More information on the optimizer used in
    https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.differential_evolution.html

    Arguments:
        free_params: probability distribution parameters under tuning. Depending on the probability
        distribution to fit, there can be one or more 
        *args: custom distribution arguments, including the 

    Returns:
        Negative log likelihood value used in the optimizer
    """
    dist, x, shift = args
    # negative log-likelihood function
    negativeLogLikelihood = -dist.logpmf(x, loc=shift, *free_params).sum()
    if np.isnan(negativeLogLikelihood):  # occurs when x is outside of support
        negativeLogLikelihood = np.inf   # we don't want that
    return negativeLogLikelihood


def fit(x, shift):
    bounds = [(0, n+1), (1, 700), (1, 700)]
    dist = stats.betabinom
    opt_res = differential_evolution(func_nll, bounds, maxiter=100, tol=0.001, seed=45,
                                        args=(dist, x, shift))

    return opt_res


x = [28, 67, 69, 69, 71, 69, 55, 65, 68, 67, 67, 68, 69, 60, 40]
n = max(x)

mode_value = Counter(x).most_common(1)[0][1]
n_zero = 4288  # high number of zeroes
is_zero_inflated = (n_zero >= mode_value)

shift_loc = 0
if is_zero_inflated:
    shift_loc = 1

b_param = fit(x, shift=shift_loc)

print(b_param)

with scipy==1.8.1 the above code snippet prints out:

  df = fun(x) - f0
     fun: 49.59657327021416
 message: 'Optimization terminated successfully.'
    nfev: 2802
     nit: 61
 success: True
       x: array([70.00205357,  6.28931939,  1.0005225 ])

When updating the scipy version to 1.9.0 the above snippet prints out:

  df = fun(x) - f0
     fun: inf
 message: 'Maximum number of iterations has been exceeded.'
    nfev: 9129
     nit: 100
 success: False
       x: array([ 45.83055307,   3.17350763, 596.7724981 ])

I'm wondering if there was a breaking change in the implementation of the beta-binomial distribution in Scipy 1.9.x that could explain this issue.


Solution

  • I'm wondering if there was a breaking change in the implementation of the beta-binomial distribution in Scipy 1.9.x that could explain this issue.

    Perhaps. Check the following in SciPy 1.8.x:

    stats.betabinom(10.5, 2, 2).pmf([1, 2])
    

    I get NaNs in recent versions, which is a bug fix. SciPy's betabinomial distribution is defined only for integer n. If it produced real values for non-integer n before, it was probably an unintential (and not mathematically sound) interpolation of some sort by the underlying library. I believe something like that was fixed in the timeframe of interest.

    In any case, it could have been a change in betabinom, differential_evolution, or even NumPy. For instance, NumPy doesn't guarantee bitwise reproducibility in their random number streams between major versions (e.g. 1.25 to 1.26), so SciPy can't guarantee exact reproducibility of stochastic algorithms like differential_evolution between major versions. (That said, SciPy also doesn't guarantee exact reproducibility in light of bug fixes and enhancements. There are guards against major regressions and backward incompatible API changes, but not that a complex algorithm like differential_evolution will always return precisely the same solution or even succeed in exactly the same space of problems.)

    But scipy.stats.fit (new in 1.9.0) solves your problem. It uses differential_evolution to numerically minimize the negative log likelihood, like you are, but it only passes integer n as a parameter, and it uses a penalized LLF that is robust if NaNs or infs arise.

    
    from collections import Counter
    import numpy as np
    from scipy import stats
    
    x = [28, 67, 69, 69, 71, 69, 55, 65, 68, 67, 67, 68, 69, 60, 40]
    n = max(x)
    
    mode_value = Counter(x).most_common(1)[0][1]
    n_zero = 4288  # high number of zeroes
    is_zero_inflated = (n_zero >= mode_value)
    
    shift_loc = 0
    if is_zero_inflated:
        shift_loc = 1
    
    # new code begins here
    bounds = dict(n=(0, n+1), a=(1, 700), b=(1, 700), loc=(shift_loc, shift_loc))
    res = stats.fit(stats.betabinom, x, bounds=bounds)
    #  params: FitParams(n=70.0, a=6.282699443065473, b=1.0, loc=1.0)
    # success: True
    # message: 'Optimization terminated successfully.'
    

    Update based on request in comments: or, to minimize changes to the original code snippet, just call differential_evolution with the integrality parameter.

    opt_res = differential_evolution(func_nll, bounds, integrality=[True, False, False], args=(dist, x, shift))