Doubts about pandas axis working my code may be off

My issue is the following, I'm creating a pandas data frame from a dictionary that ends up looking like [70k, 300]. I'm trying to normalise each cell be it either by columns and after rows, and other way around rows then columns.

I ha asked a similar question before but this was with a [70k, 70k] data frame so square and it worked just by doing this

dfNegInfoClearRev = (df - df.mean(axis=1)) / df.std(axis=1).replace(0, 1)
dfNegInfoClearRev = (dfNegInfoClearRev - dfNegInfoClearRev.mean(axis=0)) / dfNegInfoClearRev.std(axis=0).replace(0, 1)
print(dfNegInfoClearRev)

This did what I needed for the case of a [70k, 70k]. A problem came up when I tried the same principle with a [70k, 300] if I do this:


dfRINegInfo = (dfRI - dfRI.mean(axis=0)) / dfRI.std(axis=0).replace(0, 1)
dfRINegInfoRows = (dfRINegInfo - dfRINegInfo.mean(axis=1)) / dfRINegInfo.std(axis=1).replace(0, 1)

I somehow end up with a [70k, 70k+300] full of NaNs with the same names.

I ended up doing this:

dfRIInter = dfRINegInfo.sub(dfRINegInfo.mean(axis=1), axis=0)
dfRINegInfoRows = dfRIInter.div(dfRIInter.std(axis=1), axis=0).fillna(1).replace(0, 1)

print(dfRINegInfoRows)

But I'm not sure if this is what I was trying to do and don't really understand why after the column normalisation which it does work [70k, 300] the row normalisation gives me a [70k, 70k+300], and I'm not sure if the way is working is what I'm trying to do. Any help?

Solution

I think your new code is doing what you want.

If we look at a 3x3 toy example:

df = pd.DataFrame([
    [1, 2, 3],
    [2, 4, 6],
    [3, 6, 9],
])

The axis=1 mean is:

df.mean(axis=1)

# 0    2.0
# 1    4.0
# 2    6.0
# dtype: float64

And the subtraction applies to each row (i.e., [1,2,3] - [2,4,6] element-wise, [2-4-6] - [2,4,6], and [3,6,9] - [2,4,6]):

df - df.mean(axis=1)

#      0    1    2
# 0 -1.0 -2.0 -3.0
# 1  0.0  0.0  0.0
# 2  1.0  2.0  3.0

So if we have df2 shaped 3x2:

df2 = pd.DataFrame([
    [1,2],
    [3,6],
    [5,10],
])

The axis=1 mean is still length 3:

df2.mean(axis=1)

# 0    1.5
# 1    4.5
# 2    7.5
# dtype: float64

And subtraction will result in the 3rd column being nan (i.e., [1,2,nan] - [1.5,4.5,7.5] element-wise, [3,6,nan] - [1.5,4.5,7.5], and [5,10,nan] - [1.5,4.5,7.5]):

df2 - df2.mean(axis=1)

#      0    1   2
# 0 -0.5 -2.5 NaN
# 1  1.5  1.5 NaN
# 2  3.5  5.5 NaN

If you make the subtraction itself along axis=0 then it works as expected:

df2.sub(df2.mean(axis=1), axis=0)

#      0    1
# 0 -0.5  0.5
# 1 -1.5  1.5
# 2 -2.5  2.5

So when you use a default subtraction between (70000, 300) and (70000,1), there will be 69700 columns of nan.