Search code examples
python-3.xhypothesis-testscipy.statsbinomial-cdf

hypothesis test success rates between two samples with binomial distributions


I'm trying to compare success rates in two samples. I'm treating each sample as a binomial distribution, then trying to compare the success rates with a two tailed hypothesis test. (x1 is the number of successes in sample 1, and n1 is the total number of trials, similarly for sample 2.) I believe the code I have below is very similar to the process described in this post

https://stats.stackexchange.com/questions/113602/test-if-two-binomial-distributions-are-statistically-different-from-each-other

for performing a hypothesis test for success rate between two groups. however the pVal I'm getting is 1.82372486268966, which doesn't make sense. is there something I'm misunderstanding about the theory for this hypothesis test, or is there an error in my code below?

code:

import math
import scipy.stats as stats

x1=195
x2=5481
n1=135779
n2=81530


hatP = (n1*(x1/n1) + n2*(x2/n2))/(n1 + n2)
hatQ = 1 - hatP
hatP1 = x1/n1
hatP2 = x1/n2
Z = (hatP1 - hatP2)/(math.sqrt(hatP*hatQ*(1/n1 + 1/n2)))
pVal = 2*(1 - stats.norm.cdf(Z))


Z: -1.3523132192521408

pVal: 1.82372486268966

Solution

  • After correcting the hatP2 = x1/n2 to hatP1 = x2/n2, your statistic is approximately normally distributed under the null hypothesis that the two samples were drawn from the same binomial distribution.

    import numpy as np
    import scipy.stats as stats
    import matplotlib.pyplot as plt
    
    def statistic(x1, n1, x2, n2):
      hatP = (n1*(x1/n1) + n2*(x2/n2))/(n1 + n2)
      hatQ = 1 - hatP
      hatP1 = x1/n1
      hatP2 = x2/n2  # corrected from x1/n2
      Z = (hatP1 - hatP2)/(np.sqrt(hatP*hatQ*(1/n1 + 1/n2)))
      return Z
    
    n1=135779
    n2=81530
    n_trials = 100000
    proportion_h0 = 0.001  # for example
    
    x1 = stats.binom(n=n1, p=proportion_h0).rvs(size=n_trials)
    x2 = stats.binom(n=n2, p=proportion_h0).rvs(size=n_trials)
    Z = statistic(x1, n1, x2, n2)
    
    plt.hist(Z, density=True, bins=50, label='histogram')
    plt.xlabel('statistic')
    plt.ylabel('frequency density')
    plt.title('Histogram of statistic under H0')
    
    x = np.linspace(-5, 5, 300)
    plt.plot(x, stats.norm.pdf(x), label='standard normal PDF')
    plt.legend()
    

    Histogram of statistic under H0

    The calculation of the p-value depends on the alternative hypothesis. It looks like you are interested in the two-sided null hypothesis that the samples were not drawn from the same binomial distribution, in which case the p-value would be calculated as:

    x1 = 195
    x2 = 5481
    Z = statistic(x1, n1, x2, n2)
    # twice the smaller of the one-sided pvalues
    pVal = 2*np.minimum(stats.norm.cdf(Z), stats.norm.sf(Z))  # ~0
    

    In your case, the p-value underflows; it is too small to be represented in double precision.

    Rather than writing a custom test, consider using scipy.stats.fisher_exact, scipy.stats.barnard_exact, or scipt.stats.boschloo_exact.