Search code examples

Pandas Creating Normal Dist series

I'm trying to convert an excel "normal distribution" formula into python.


For example: Here's my given df

Id   a  b    c    
ijk  4  3.5  12.53
xyz  12 3    10.74

My goal:

Id   a  b    c      0    1    2    3
ijk  4  3.5  12.53  1   .93  .87  .81
xyz  12 3    10.74  1   .87  .76  .66

Here's the math behind it:

column 0: always 1

column 1: (1-NORM.DIST(a+1,b,c,TRUE))/(1-NORM.DIST(a,b,c,TRUE))

column 2: (1-NORM.DIST(a+2,b,c,TRUE))/(1-NORM.DIST(a,b,c,TRUE))

column 3: (1-NORM.DIST(a+3,b,c,TRUE))/(1-NORM.DIST(a,b,c,TRUE))

This is what I have so far:

df1 = pd.DataFrame(df, columns=np.arange(0,4))
result = pd.concat([df, df1], axis=1, join_axes=[df.index])
result[0] = 1

I'm not sure what to do after this.

This is how I use the normal distribution function:

Many many thanks!


  • NORM.DIST(..., TRUE) means the cumulative distribution function and 1 - NORM.DIST(..., TRUE) means the survival function. These are available under scipy's stats module (see ss.norm). For example,

    import scipy.stats as ss
    ss.norm.cdf(4, 3.5, 12.53)

    For your case, you can first define a function:

    def normalize(a, b, c, col):
        return ss.norm.sf(a+col, b, c) / ss.norm.sf(a, b, c)

    and call that function with apply:

    for col in range(4):
        df[col] = df.apply(lambda x: normalize(x.a, x.b, x.c, col), axis=1)
        Id   a    b      c    0         1         2         3
    0  ijk   4  3.5  12.53  1.0  0.934455  0.869533  0.805636
    1  xyz  12  3.0  10.74  1.0  0.875050  0.760469  0.656303

    This is not the most efficient approach as it calculates the survival function for same values again and involves two loops. One level of loops can be omitted by passing an array of values to ss.sf:

    out = df.apply(
        lambda x: pd.Series(
            ss.norm.sf(x.a + np.arange(4), x.b, x.c) / ss.norm.sf(x.a, x.b, x.c)
            ), axis=1
         0         1         2         3
    0  1.0  0.934455  0.869533  0.805636
    1  1.0  0.875050  0.760469  0.656303

    And you can use join to add this to your original DataFrame:

        Id   a    b      c    0         1         2         3
    0  ijk   4  3.5  12.53  1.0  0.934455  0.869533  0.805636
    1  xyz  12  3.0  10.74  1.0  0.875050  0.760469  0.656303