Search code examples
pythonpandasreplacetypesmedian

Replacing all 0s in a column in python dataframe with column's median value changes datatype to 'O'


I have a large pandas dataframe with 10000 rows and 33 columns. One of the columns is 'Age' which has datatype 'int64' and considerable missing values.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 33 columns):
customer                      10000 non-null int64
age                          10000 non-null int64

The missing values have been recorded as 0 in the data. Missing values:

 df['customer'][df[' age']==0].count()
 >2942

I am trying to replace all such 0s with the median value:

df[' age'].replace(to_replace=0, value = df[' age'].median, inplace = True)

This seems to run fine. But it changes the datatype of the column to O:

df[' age'].dtype
>dtype('O')

What is going wrong?


Solution

  • It is probably better to replace the missing data with NaNs, and then fill those NaN values with the median.

    Otherwise you are actually taking into account the missing data to calculate the median

    df = pd.DataFrame([0,1,2,3,], columns = ['data'])
    df[df.data == 0] = np.nan
    print(df)
    
       data
    0   NaN
    1   1.0
    2   2.0
    3   3.0
    
    df.fillna(df.median())
    
       data
    0   2.0
    1   1.0
    2   2.0
    3   3.0