I have a pandas dataframe with couple of columns.
I calculated z-score based on mean and standard deviation for one of the column.
Now, i would like to know what distribution based on z-score? Based on histogram i can tell its normal distribution.
Is there an programmatic to tell distribution type based on z-score?
I'm new to statistics. so maybe i'm missing something very simple.
Sample code:
df[col_zscore] = (df[column] - df[column].mean())/df[column].std(ddof=0)
If distribution is normal distribution, from 68–95–99.7
rule, 68%
of the df[col_zscore]
will be between -1
to 1
, 95%
between -2
to 2
, and 99.7%
between -3
to 3
. On the other hand extreme, the z score is infinity for a fixed number.
You can check if it is close to normal or a fixed value by the following function:
import math
def three_sigma_rule(input):
input = input.tolist()
one_sigma = (len([ele for ele in input if -1<ele<1])) / len(input) * 100
two_sigma = (len([ele for ele in input if -2<ele<2])) / len(input) * 100
three_sigma = (len([ele for ele in input if -3<ele<3])) / len(input) * 100
print("Percentage of the z-score between -1 to 1: {0}%".format(one_sigma))
print("Percentage of the z-score between -2 to 2: {0}%".format(two_sigma))
print("Percentage of the z-score between -3 to 3: {0}%".format(three_sigma))
condition1 = math.isclose(one_sigma,68,rel_tol=0.1)
condition2 = math.isclose(two_sigma,95,rel_tol=0.1)
condition3 = math.isclose(three_sigma,99.7,rel_tol=0.1)
condition4 = np.isnan(input).all()
if condition1 and condition2 and condition3:
print("It is normal distribution.")
if condition4:
print("It is fixed value.")
Let's generate some random numbers:
if __name__ == "__main__":
import pandas as pd
import numpy as np
n = 100000
df = pd.DataFrame(dict(
a=np.random.normal(5,3,size=n),
b=np.random.uniform(low=-100, high=10000, size=n),
c=np.random.uniform(low=5, high=5, size=n),
))
df['a_zscore'] = (df['a'] - df['a'].mean())/df['a'].std(ddof=0)
df['b_zscore'] = (df['b'] - df['b'].mean())/df['b'].std(ddof=0)
df['c_zscore'] = (df['c'] - df['c'].mean())/df['c'].std(ddof=0)
Output of three_sigma_rule(df['a_zscore'])
: