My issue is the following, I'm creating a pandas data frame from a dictionary that ends up looking like [70k, 300]. I'm trying to normalise each cell be it either by columns and after rows, and other way around rows then columns.
I ha asked a similar question before but this was with a [70k, 70k] data frame so square and it worked just by doing this
dfNegInfoClearRev = (df - df.mean(axis=1)) / df.std(axis=1).replace(0, 1)
dfNegInfoClearRev = (dfNegInfoClearRev - dfNegInfoClearRev.mean(axis=0)) / dfNegInfoClearRev.std(axis=0).replace(0, 1)
print(dfNegInfoClearRev)
This did what I needed for the case of a [70k, 70k]. A problem came up when I tried the same principle with a [70k, 300] if I do this:
dfRINegInfo = (dfRI - dfRI.mean(axis=0)) / dfRI.std(axis=0).replace(0, 1)
dfRINegInfoRows = (dfRINegInfo - dfRINegInfo.mean(axis=1)) / dfRINegInfo.std(axis=1).replace(0, 1)
I somehow end up with a [70k, 70k+300] full of NaNs with the same names.
I ended up doing this:
dfRIInter = dfRINegInfo.sub(dfRINegInfo.mean(axis=1), axis=0)
dfRINegInfoRows = dfRIInter.div(dfRIInter.std(axis=1), axis=0).fillna(1).replace(0, 1)
print(dfRINegInfoRows)
But I'm not sure if this is what I was trying to do and don't really understand why after the column normalisation which it does work [70k, 300] the row normalisation gives me a [70k, 70k+300], and I'm not sure if the way is working is what I'm trying to do. Any help?
I think your new code is doing what you want.
If we look at a 3x3 toy example:
df = pd.DataFrame([
[1, 2, 3],
[2, 4, 6],
[3, 6, 9],
])
The axis=1
mean is:
df.mean(axis=1)
# 0 2.0
# 1 4.0
# 2 6.0
# dtype: float64
And the subtraction applies to each row (i.e., [1,2,3] - [2,4,6]
element-wise, [2-4-6] - [2,4,6]
, and [3,6,9] - [2,4,6]
):
df - df.mean(axis=1)
# 0 1 2
# 0 -1.0 -2.0 -3.0
# 1 0.0 0.0 0.0
# 2 1.0 2.0 3.0
So if we have df2
shaped 3x2:
df2 = pd.DataFrame([
[1,2],
[3,6],
[5,10],
])
The axis=1
mean is still length 3:
df2.mean(axis=1)
# 0 1.5
# 1 4.5
# 2 7.5
# dtype: float64
And subtraction will result in the 3rd column being nan
(i.e., [1,2,nan] - [1.5,4.5,7.5]
element-wise, [3,6,nan] - [1.5,4.5,7.5]
, and [5,10,nan] - [1.5,4.5,7.5]
):
df2 - df2.mean(axis=1)
# 0 1 2
# 0 -0.5 -2.5 NaN
# 1 1.5 1.5 NaN
# 2 3.5 5.5 NaN
If you make the subtraction itself along axis=0
then it works as expected:
df2.sub(df2.mean(axis=1), axis=0)
# 0 1
# 0 -0.5 0.5
# 1 -1.5 1.5
# 2 -2.5 2.5
So when you use a default subtraction between (70000, 300)
and (70000,1)
, there will be 69700 columns of nan
.