Search code examples
pythonscipystatistics

Z-Score as measurement of diverging values


I've been trying to use the z-score to filter out odd values in python. For the calculation I've used the version scipy is offering, vs calculating it myself using numpy and the mean and std functions. The result is the same. I thought a p-Value of -1 to 1 should result in 68,1% of the samples, or maybe I've got the concept wrong and it solely is representative of the values itself.

However, here is the example where I'd expect an output of closer to 0.682 not 0.57.

import numpy
from scipy import stats

arr = numpy.array(range(1, 1000))

col_z_score = stats.zscore(arr)

print((~numpy.bitwise_or(-1 >= col_z_score, 1 <= col_z_score)).sum() / len(col_z_score))
print((numpy.bitwise_and(1 >= col_z_score, -1 <= col_z_score)).sum() / len(col_z_score))

Solution

  • The 68,1% rule works with normal distributions.

    arr = np.array(range(1, 1000)) follows the uniform distribution, hence the 57%.

    To generate a normal distribution you can use this:

    arr = np.random.normal(0, 1, 1000)

    Also, bitwise_or or bitwise_and are wrong in this case, you should use logical_or or logical_and:

    within_range = np.logical_and(col_z_score >= -1, col_z_score <= 1)
    
    proportion_within_range = within_range.sum() / len(col_z_score)