Search code examples
pythonpandasmissing-datafillna

Filling null values with mean


I am given a data set with many NaN values and I wanted to fill the null value with the mean of each column. So I tried the following code:

def fill_mean():  
    m = [df.columns.get_loc(c) for c in df.columns if c in missing]
    for i in m:
        df[df.columns[i]] =df[df.columns[i]].fillna(value=df[df.columns[i]].mean())
    return df

but I get this error:

TypeError: must be str, not int

The columns I'm trying to fill are all composed by the same type: which is either 'float64' or 'O'.
I suspect the problem derives from this fact, but how can I solve it?


Edit: I created a dictionary containing the column which contains the index of the columns where some data are missing and each column's type.

di = dict(zip(missing, m2)) 
def fill_mean():
    m = [df.columns.get_loc(c) for c in df.columns if c in missing]
    for i in m:
        if di[m] == "dtype('float64')":
            df[df.columns[i]] = df[df.columns[i]].fillna(value=df[df.columns[i]].mean())
    return df

If I run fill_mean(), now I get a different error:

    if di[m] == "dtype('float64')":

TypeError: unhashable type: 'list'

Solution

  • I think you want to first cast your columns as type float, then use df.fillna, using df.mean() as the value argument:

    df[["columns", "to", "change"]] = df[["columns", "to", "change"]].astype('float')
    
    df.fillna(df.mean())
    

    Note: If all your columns in your dataframe can be cast to float, then you can simply do:

    df = df.astype('float').fillna(df.astype('float').mean())
    

    Example:

    df = pd.DataFrame({'col1':np.random.choice([np.nan, '1','2'], 10), 
         'col2':np.random.choice([np.nan, '1', '2'], 10)})
    
    
    >>> print(df)
      col1 col2
    0    2    1
    1    2    1
    2  nan  nan
    3    1    2
    4    1    2
    5  nan    2
    6    2    2
    7    2    2
    8    1    2
    9  nan    1
    
    df[['col1', 'col2']] = df[['col1', 'col2']].astype('float')
    
    df = df.fillna(df.mean())
    
    
    >>> print(df)
           col1      col2
    0  2.000000  1.000000
    1  2.000000  1.000000
    2  1.571429  1.666667
    3  1.000000  2.000000
    4  1.000000  2.000000
    5  1.571429  2.000000
    6  2.000000  2.000000
    7  2.000000  2.000000
    8  1.000000  2.000000
    9  1.571429  1.000000