Search code examples
pythonpandasimputation

How to fill missing values in a dataframe depending on conditions?


I want to replace null values with df[col].mean() when df[col] is not all null values.

I implement code like below:

if train_x[cols].isna().sum() == len(train_x): # need to fix
    train_x.loc[:, cols] = train_x[cols].fillna(value=0.0)
else:
    train_x.loc[:, cols] = train_x[cols].fillna(value=train_x[cols].mean())

This code has error, because train_x[cols] is a dataframe, but I need to put each column under condition.

Is there a better way to implement my purpose?

Sorry for my poor English skills.


Solution

  • With the following toy dataframe:

    import pandas as pd
    
    df = pd.DataFrame(
        {"col1": [1, 9, pd.NA], "col2": [pd.NA, pd.NA, pd.NA], "col3": [8, 4, 3]}
    )
    
    print(df)
    # Output
       col1  col2  col3
    0     1  <NA>     8
    1     9  <NA>     4
    2  <NA>  <NA>     3
    

    Here is one way to do it:

    for col in df.columns:
        if df[col].isna().sum() == df.shape[0]:
            df[col] = 0
        else:
            df[col] = df[col].fillna(df[col].mean())
    

    Then:

    print(df)
    # Output
       col1  col2  col3
    0   1.0     0     8
    1   9.0     0     4
    2   5.0     0     3