Search code examples
pythonoptimizationt-test

How do you efficiently perform millions of t-tests in Python?


Long story short, I need to perform several hundred million t-tests. I have two lists of samples, ys and ns, and I want to compare a sample from each list, so the first sample in ys will be compared to the first sample in ns and so on. The result will be a list of p-values, one from each comparison. What is the fastest way to do this? Currently, I am using the map function

p_values = [result[1] for result in list(map(ttest_ind, ys, ns))]

but it is still slow. numpy.vectorize looks like it might be faster, but I can't figure out how to use it with a function that takes two lists as input. Would it be faster if I hard coded the t-test math instead of using scipy.stats.ttest_ind?


Solution

  • The whole idea is: not running this in Python, but in C/C++.

    For which you have two choices:

    1. Write it in C/C++ yourself and connect it with Python.
    2. Try to work with C/C++-backended libraries like Numpy. Pack your data as Numpy types and operate them with Numpy functions. In the backend it is running in C/C++, the same as 1, and it will be much easier.