Search code examples
pythonpandasdataframelambdaapply

Replacing values in pands df (not Nan) with the column sum except first two cols


Update: I mixed up the axis. Thats why my question seems weired. This is the new input data:

give following data:

import pandas as pd

data = {'Org': ['Tom', 'Kelly', 'Rick', 'Dave','Sara','Liz'],
    'A': ['NaN', 1, 1, 1, 'NaN', 'NaN'],
    'B': [1, 1, 1, 1, 'NaN', 1],
    'C': [1, 1, 1, 1, 1, 1],
    'D': ['NaN', 'NaN', 1, 'NaN', 1, 'NaN'],
    'E': [1, 1, 1, 1, 'NaN', 1],
    'F': ['NaN', 1, 1, 1, 'NaN', 1]}

df = pd.DataFrame(data)

I want to sum the columns except the first two and then replace the values not NaN with the sum the column results:

the result should like this:

data = {'Org': ['Tom', 'Kelly', 'Rick', 'Dave','Sara','Liz'],
    'A': ['NaN', 1, 1, 1, 'NaN', 'NaN'],
    'B': [5, 5, 5, 5, 'NaN', 5],
    'C': [6, 6, 6, 6, 6, 6],
    'D': ['NaN', 'NaN', 2, 'NaN', 2, 'NaN'],
    'E': [5, 5, 5, 5, 'Nan', 5],
    'F': ['NaN',4, 4, 4, 'NaN', 4]}

df = pd.DataFrame(data)

I tried:

column_sums = df.iloc[:, 2:].sum()
for column in iloc[:, 2:].columns:
    df[column] = column_sums[column]

But that replaces me all values.

Is there a smooth solution for that possible?

Thanks


Solution

  • You can consider broadcasting a multiplication in numpy.

    Data

    import pandas as pd
    import numpy as np
    
    data = {'Org': ['Tom', 'Kelly', 'Rick', 'Dave','Sara','Liz'],
        'A': ['NaN', 1, 1, 1, 'NaN', 'NaN'],
        'B': [1, 1, 1, 1, 'NaN', 1],
        'C': [1, 1, 1, 1, 1, 1],
        'D': ['NaN', 'NaN', 1, 'NaN', 1, 'NaN'],
        'E': [1, 1, 1, 1, 'NaN', 1],
        'F': ['NaN', 1, 1, 1, 'NaN', 1]}
    
    df = pd.DataFrame(data)
    df = df.replace('NaN', np.nan)
    

    Mask nonnull values

    Here we want NaN values to be NaN that's why we replace 0 with np.nan

    m = df[df.columns[2:]].notnull()\
        .astype(float).replace(0, np.nan)
    print(m)
    
         B    C    D    E    F
    0  1.0  1.0  NaN  1.0  NaN
    1  1.0  1.0  NaN  1.0  1.0
    2  1.0  1.0  1.0  1.0  1.0
    3  1.0  1.0  NaN  1.0  1.0
    4  NaN  1.0  1.0  NaN  NaN
    5  1.0  1.0  NaN  1.0  1.0
    

    Sum for all columns but first 2

    s = df[df.columns[2:]].sum()
    print(s)
    
    B    5.0
    C    6.0
    D    2.0
    E    5.0
    F    4.0
    dtype: float64
    

    Multiply

    df[df.columns[2:]] = m * s
    
    print(df)
    
         Org    A    B    C    D    E    F
    0    Tom  NaN  5.0  6.0  NaN  5.0  NaN
    1  Kelly  1.0  5.0  6.0  NaN  5.0  4.0
    2   Rick  1.0  5.0  6.0  2.0  5.0  4.0
    3   Dave  1.0  5.0  6.0  NaN  5.0  4.0
    4   Sara  NaN  NaN  6.0  2.0  NaN  NaN
    5    Liz  NaN  5.0  6.0  NaN  5.0  4.0