Search code examples
pythonmathscipystatistics

Threshold at chi2.cdf(), below of which it doesn’t warrant the use of computational resources


I need to run a chi-square test on my dataset to find the p-value. The obvious choice is to use chi2_contingency() and chi2.cdf() from scipy.stats. But the p-value = 5.723076338262742e-82 is so tiny that it takes 3 seconds just to compute this simple dataset. I want to avoid this slow process by setting a custom threshold in chi2.cdf(). If the p-value is much smaller than 0.01, I don't think it's worth the computational effort to calculate it.

My example dataset is:

# Observed data
observed = np.array([[150, 700], [350, 150]])

# Perform the chi-square test
chi2, p, dof, expected = chi2_contingency(observed)

# Print the results
print(f"Chi2 Statistic: {chi2}")
print(f"P-value: {p}")
print(f"Degrees of Freedom: {dof}")
print("Expected Frequencies:")
print(expected)
# Results
Chi2 Statistic: 367.7704987889273
P-value: 5.723076338262742e-82
Degrees of Freedom: 1
Expected Frequencies:
[[314.81481481 535.18518519]
 [185.18518519 314.81481481]]

My approach

I tried to bypass the computation, but even this approach compares the p-value with the threshold posteriori.

from scipy.stats import chi2, chi2_contingency

# Observed data
observed = np.array([[150, 700], [350, 150]])

# Perform the chi-square test
chi2_stat, p_value, dof, expected = chi2_contingency(observed)

# Set your threshold (for example, 0.01)
threshold = 0.01

# Check if p-value is below the threshold
if p_value < threshold:
    print(f'P-value is extremely small (<{threshold}). Skipping the exhaustive computation.')
else:
    # Compute the actual p-value
    p_value = 1 - chi2.cdf(chi2_stat, dof)
    print(f'P-value: {p_value}')

Conclusion

To wrap it up, I’m looking for a programming way to avoid calculating the p-value each time — only if it’s >= 0.01. Looking forward for your input!


Solution

  • You can implement the calculation of the statistic yourself to avoid having chi2_contingency perform the p-value calculation, but I don't think it's worth your time because chi2_contingency(observed) takes less than half of a millisecond on Google Colab for your data.

    %timeit chi2_contingency(observed)
    # 415 µs ± 30.8 µs per loop (mean ± std.)
    

    Calculating the p-value itself accounts for a fraction of that, and it will not depend noticeably on the values. The distribution infrastructure has a ton of overhead; the underlying special function call is only a few microseconds (and even that is mostly data-independent overhead).

    I imagine the time you are observing is really the imports or the statistic calculation (if your real data is different from the example you've given here), but if chi2_contingency is really so slow on your machine, submit a bug report with SciPy (https://github.com/scipy/scipy/issues).