Search code examples
pythonpandasrownormalization

Doubts about pandas axis working my code may be off


My issue is the following, I'm creating a pandas data frame from a dictionary that ends up looking like [70k, 300]. I'm trying to normalise each cell be it either by columns and after rows, and other way around rows then columns.

I ha asked a similar question before but this was with a [70k, 70k] data frame so square and it worked just by doing this

dfNegInfoClearRev = (df - df.mean(axis=1)) / df.std(axis=1).replace(0, 1)
dfNegInfoClearRev = (dfNegInfoClearRev - dfNegInfoClearRev.mean(axis=0)) / dfNegInfoClearRev.std(axis=0).replace(0, 1)
print(dfNegInfoClearRev)

This did what I needed for the case of a [70k, 70k]. A problem came up when I tried the same principle with a [70k, 300] if I do this:


dfRINegInfo = (dfRI - dfRI.mean(axis=0)) / dfRI.std(axis=0).replace(0, 1)
dfRINegInfoRows = (dfRINegInfo - dfRINegInfo.mean(axis=1)) / dfRINegInfo.std(axis=1).replace(0, 1)

I somehow end up with a [70k, 70k+300] full of NaNs with the same names.

I ended up doing this:

dfRIInter = dfRINegInfo.sub(dfRINegInfo.mean(axis=1), axis=0)
dfRINegInfoRows = dfRIInter.div(dfRIInter.std(axis=1), axis=0).fillna(1).replace(0, 1)

print(dfRINegInfoRows)

But I'm not sure if this is what I was trying to do and don't really understand why after the column normalisation which it does work [70k, 300] the row normalisation gives me a [70k, 70k+300], and I'm not sure if the way is working is what I'm trying to do. Any help?


Solution

  • I think your new code is doing what you want.

    If we look at a 3x3 toy example:

    df = pd.DataFrame([
        [1, 2, 3],
        [2, 4, 6],
        [3, 6, 9],
    ])
    

    The axis=1 mean is:

    df.mean(axis=1)
    
    # 0    2.0
    # 1    4.0
    # 2    6.0
    # dtype: float64
    

    And the subtraction applies to each row (i.e., [1,2,3] - [2,4,6] element-wise, [2-4-6] - [2,4,6], and [3,6,9] - [2,4,6]):

    df - df.mean(axis=1)
    
    #      0    1    2
    # 0 -1.0 -2.0 -3.0
    # 1  0.0  0.0  0.0
    # 2  1.0  2.0  3.0
    

    So if we have df2 shaped 3x2:

    df2 = pd.DataFrame([
        [1,2],
        [3,6],
        [5,10],
    ])
    

    The axis=1 mean is still length 3:

    df2.mean(axis=1)
    
    # 0    1.5
    # 1    4.5
    # 2    7.5
    # dtype: float64
    

    And subtraction will result in the 3rd column being nan (i.e., [1,2,nan] - [1.5,4.5,7.5] element-wise, [3,6,nan] - [1.5,4.5,7.5], and [5,10,nan] - [1.5,4.5,7.5]):

    df2 - df2.mean(axis=1)
    
    #      0    1   2
    # 0 -0.5 -2.5 NaN
    # 1  1.5  1.5 NaN
    # 2  3.5  5.5 NaN
    

    If you make the subtraction itself along axis=0 then it works as expected:

    df2.sub(df2.mean(axis=1), axis=0)
    
    #      0    1
    # 0 -0.5  0.5
    # 1 -1.5  1.5
    # 2 -2.5  2.5
    

    So when you use a default subtraction between (70000, 300) and (70000,1), there will be 69700 columns of nan.