Search code examples
pythonscipystatisticsconfidence-intervalscipy.stats

Confidence intervals with scipy


I have an array of shape (n, timesteps), where n is the number of trials and timesteps is the length of each trial. Each value of this array denotes a stochastic measurement.
I would like to implement a generic function that computes a confidence interval for a given statistic (mean, median, ...) assuming 1) the underlying distribution is normal, 2) is Student's.

Something like

def normal_ci(
    data: np.array,
    axis: int = 0,
    statistic: Callable = np.mean,
    confidence: float = 0.95
):

and a similar function student_ci().

My problem is that, by default, scipy functions compute intervals for the mean, am I right? Like in this answer.


Solution

  • This is on Stack Overflow, so the computational answer is to use the bootstrap.

    from typing import Callable
    import numpy as np
    from scipy import stats
    rng = np.random.default_rng(84354894298246)
    
    def normal_ci(
        data: np.array,
        axis: int = 0,
        statistic: Callable = np.mean,
        confidence: float = 0.95
    ):
        res = stats.bootstrap((data,), statistic, axis=axis,
                              confidence_level=confidence)
        return tuple(res.confidence_interval)
    
    # generate 1000 samples, each of length 100
    data = rng.standard_normal(size=(1000, 100))
    
    # compute the confidence interval for each
    low, high = normal_ci(data, axis=-1)
    
    # the confidence interval contains the population value
    # of the statistic in 95% of cases
    np.mean((low < 0) & (0 < high))  # 0.953
    

    Since you know the families from which your data are drawn, you could look into the parametric bootstrap.

    def normal_ci(
        data: np.array,
        axis: int = 0,
        statistic: Callable = np.mean,
        confidence: float = 0.95
    ):
        # fit a normal distribution to the data
        mean = np.mean(data, axis=axis)
        std = np.std(data, ddof=1, axis=axis)
    
        # resample data from the fitted normal distribution
        n_resamples = 999
        m = data.shape[axis]  # assuming axis is an integer
        resample = rng.normal(loc=mean, scale=std, size=(m, n_resamples) + mean.shape)
    
        # compute the statistic for each of the resamples to estimate
        # the distribution
        statistic_distribution = statistic(resample, axis=0)
    
        # Generate the confidence interval
        # percentile bootstrap (very crude)
        alpha = 1 - confidence
        low, high = np.quantile(statistic_distribution, [alpha/2, 1-alpha/2], axis=0)
        return low, high
    
    low, high = normal_ci(data, axis=-1)
    np.mean((low < 0) & (0 < high))  # 0.954
    

    SciPy's bootstrap function uses the BCa method by default, which is reasonably sophisticated; this parametric bootstrap is what I could write in a few minutes. So for parametric bootstrap, you really should research what other libraries are out there.

    If you're not happy with the bootstrap, you'd need to do some research: for each of the statistics and population families you're interested in, you need to know the distribution of the statistic of a sample drawn from your population. For a sample from a normal distribution, the sample mean is t-distributed, the variance is chi-squared distributed, etc... and that is not a question for Stack Overflow.