I have a large pandas dataframe with 10000 rows and 33 columns. One of the columns is 'Age' which has datatype 'int64' and considerable missing values.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 33 columns):
customer 10000 non-null int64
age 10000 non-null int64
The missing values have been recorded as 0 in the data. Missing values:
df['customer'][df[' age']==0].count()
>2942
I am trying to replace all such 0s with the median value:
df[' age'].replace(to_replace=0, value = df[' age'].median, inplace = True)
This seems to run fine. But it changes the datatype of the column to O:
df[' age'].dtype
>dtype('O')
What is going wrong?
It is probably better to replace the missing data with NaNs, and then fill those NaN values with the median.
Otherwise you are actually taking into account the missing data to calculate the median
df = pd.DataFrame([0,1,2,3,], columns = ['data'])
df[df.data == 0] = np.nan
print(df)
data
0 NaN
1 1.0
2 2.0
3 3.0
df.fillna(df.median())
data
0 2.0
1 1.0
2 2.0
3 3.0