Search code examples
pandasnormal-distribution

Pandas Creating Normal Dist series


I'm trying to convert an excel "normal distribution" formula into python.

(1-NORM.DIST(a+col,b,c,TRUE))/(1-NORM.DIST(a,b,c,TRUE)))

For example: Here's my given df

Id   a  b    c    
ijk  4  3.5  12.53
xyz  12 3    10.74

My goal:

Id   a  b    c      0    1    2    3
ijk  4  3.5  12.53  1   .93  .87  .81
xyz  12 3    10.74  1   .87  .76  .66

Here's the math behind it:

column 0: always 1

column 1: (1-NORM.DIST(a+1,b,c,TRUE))/(1-NORM.DIST(a,b,c,TRUE))

column 2: (1-NORM.DIST(a+2,b,c,TRUE))/(1-NORM.DIST(a,b,c,TRUE))

column 3: (1-NORM.DIST(a+3,b,c,TRUE))/(1-NORM.DIST(a,b,c,TRUE))

This is what I have so far:

df1 = pd.DataFrame(df, columns=np.arange(0,4))
result = pd.concat([df, df1], axis=1, join_axes=[df.index])
result[0] = 1

I'm not sure what to do after this.

This is how I use the normal distribution function: https://support.office.com/en-us/article/normdist-function-126db625-c53e-4591-9a22-c9ff422d6d58

Many many thanks!


Solution

  • NORM.DIST(..., TRUE) means the cumulative distribution function and 1 - NORM.DIST(..., TRUE) means the survival function. These are available under scipy's stats module (see ss.norm). For example,

    import scipy.stats as ss
    ss.norm.cdf(4, 3.5, 12.53)
    Out:
    0.51591526057026538
    

    For your case, you can first define a function:

    def normalize(a, b, c, col):
        return ss.norm.sf(a+col, b, c) / ss.norm.sf(a, b, c)
    

    and call that function with apply:

    for col in range(4):
        df[col] = df.apply(lambda x: normalize(x.a, x.b, x.c, col), axis=1)
    
    df
    Out: 
        Id   a    b      c    0         1         2         3
    0  ijk   4  3.5  12.53  1.0  0.934455  0.869533  0.805636
    1  xyz  12  3.0  10.74  1.0  0.875050  0.760469  0.656303
    

    This is not the most efficient approach as it calculates the survival function for same values again and involves two loops. One level of loops can be omitted by passing an array of values to ss.sf:

    out = df.apply(
        lambda x: pd.Series(
            ss.norm.sf(x.a + np.arange(4), x.b, x.c) / ss.norm.sf(x.a, x.b, x.c)
            ), axis=1
    )
    
    Out: 
         0         1         2         3
    0  1.0  0.934455  0.869533  0.805636
    1  1.0  0.875050  0.760469  0.656303
    

    And you can use join to add this to your original DataFrame:

    df.join(out)
    Out: 
        Id   a    b      c    0         1         2         3
    0  ijk   4  3.5  12.53  1.0  0.934455  0.869533  0.805636
    1  xyz  12  3.0  10.74  1.0  0.875050  0.760469  0.656303