here am using ztest
built-in function within statsmodels
to do single hypothesis test , however If I want to run many separate hypothesis tests - on many different columns
- to test say the difference between two medians
or two means
, then it becomes cumbersome when doing it one by one , Is there faster and efficient way (memory and time wise) to run n
number of these tests , to be more specific, say we have a dataframe
of n columns
, and I wanna test the difference between a mean or median return of certain trading days or (sequence of them) for a certain ticker versus the overall mean of that ticker over some period of time say 5 years (with daily values), now in the standard case , one would use
from statsmodels.stats.weightstats import ztest
ztest_Score, p_value = ztest(df_altenative['symbol is here'], df_null , alternative='two-sided')
where of course df_null above is scalar quantity(say daily average return for the entire period), and df_alternative is a column
within a larger dataframe
of tickers , and it holds the mean or median of your sequence trading days
, then , how one can do this iterative procedure in just one line of code if possible where it goes over each one of these separate columns within my data frame and the corresponding associated mean or median value and compare them to decide on which hypothesis to be rejected or not ?
best regards
First, the one-sample hypothesis test is vectorized. Here I assume the value under the null is 0:
from statsmodels.stats.weightstats import ztest
x = np.random.randn(100, 4)
ztest_Score, p_value = ztest(x, value=0 , alternative='two-sided')
ztest_Score, p_value
(array([1.69925429, 0.5359994 , 0.05777533, 0.78699997]),
array([0.08927128, 0.59195896, 0.95392759, 0.43128188]))
[ztest(x[:, i], value=0 , alternative='two-sided') for i in range(x.shape[1])]
[(1.699254292717283, 0.0892712806133958),
(0.5359994032597257, 0.5919589628688362),
(0.057775326408478586, 0.953927592014832),
(0.7869999680163862, 0.43128188488265284)]
Second, the two sample test is vectorized with appropriate numpy broadcasting.
The following compares each column of the first sample to the second sample y
,
y = np.random.randn(100)
statistic, p_value = ztest(x, y, alternative='two-sided')
statistic, p_value
(array([1.36445473, 0.50622444, 0.15362677, 0.64741684]),
array([0.17242449, 0.6126991 , 0.87790403, 0.5173622 ]))
[ztest(x[:, i], y, alternative='two-sided') for i in range(x.shape[1])]
[(1.364454734896, 0.17242449122265047),
(0.5062244362943313, 0.6126991023616855),
(0.15362676881725684, 0.8779040290306083),
(0.6474168385742498, 0.5173622008385331)]
statistic, p_value = ztest(x, y[:, None], alternative='two-sided')
statistic, p_value
(array([1.36445473, 0.50622444, 0.15362677, 0.64741684]),
array([0.17242449, 0.6126991 , 0.87790403, 0.5173622 ]))
To case in the question:
The two sample case cannot have a single observation in one of the samples. The ztest needs to compute the variance for the samples to compute the inferential statistics like p-values. Specifically, the ztest (or ttest) needs to compute the standard error of the mean estimate of both samples. This depends on the sample sizes. If a sample has only a single observation, then pooled variance is used but the standard error of the mean will be very large.
So, the option is to use either the one-sample z-test, which assumes that the second "mean" has no uncertainty, or to use the two sample test with the full data series as second sample, which will compute the standard error of its mean from the sample.