Search code examples
pythonnumpyscipystatisticsconfidence-interval

Compute a confidence interval from sample data assuming unknown distribution


I have sample data for which I would like to compute a confidence interval, assuming a distribution that is not normal and is unknown. Basically, it looks like the distribution is Pareto. Distribution histogram But I don't know for sure.

The answers for the normal distribution:

Compute a confidence interval from sample data

Correct way to obtain confidence interval with scipy


Solution

  • If you don't know the underlying distribution, then my first thought would be to use bootstrapping: https://en.wikipedia.org/wiki/Bootstrapping_(statistics)

    In pseudo-code, assuming x is a numpy array containing your data:

    import numpy as np
    N = 10000
    mean_estimates = []
    for _ in range(N):
        re_sample_idx = np.random.randint(0, len(x), x.shape)
        mean_estimates.append(np.mean(x[re_sample_idx]))
    

    mean_estimates is now a list of 10000 estimates of the mean of the distribution. Take the 2.5th and 97.5th percentile of these 10000 values, and you have a confidence interval around the mean of your data:

    sorted_estimates = np.sort(np.array(mean_estimates))
    conf_interval = [sorted_estimates[int(0.025 * N)], sorted_estimates[int(0.975 * N)]]