I am trying to run the fillna command in python. It simply fails to replace the Nan values with anything, and it does not return an error.
import pandas as pd
import io
import requests
import numpy as np
url='https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data'
s=requests.get(url).content
df=pd.read_csv(io.StringIO(s.decode('utf-8')))
df.columns=['Scn', 'A2', 'A3', 'A4', 'A5', 'A6', 'A7', 'A8', 'A9', 'A10', 'CLASS']
df.to_csv("wisconsinbreast.csv")
m,n=df.shape
#print(m,n)
df = df.replace('?', np.nan)
#print(df)
#print(df.mean())
print(df.fillna(df.mean()))
In line 22, Nan is still there. I have done everything that I can find by searching questions here, but this is not even giving me feedback on why it is failing. As I understand it, the df.mean should calculate without the Nan values, but df.mean does not return a value for the column that contains Nan.
na_values
in read_csv
That '?'
trips everything up. When read_csv
sees it, it assumes the whole column is of dtype object
and reads it in as strings. Sure, you could fix this after the fact but I suggest using the na_values
argument to head this off at the beginning:
df = pd.read_csv(io.StringIO(s.decode('utf-8')), na_values=['?'])
pd.to_numeric
But if you really wanted to fix it after the fact, do this instead of the replace
df.A7 = pd.to_numeric(df.A7, errors='coerce')
In either case, the fillna
should work as expected afterwards
df.fillna(df.mean())
Scn A2 A3 A4 A5 A6 A7 A8 A9 A10 CLASS
0 1002945 5 4 4 5 7 10.000000 3 2 1 2
1 1015425 3 1 1 1 2 2.000000 3 1 1 2
2 1016277 6 8 8 1 3 4.000000 3 7 1 2
3 1017023 4 1 1 3 2 1.000000 3 1 1 2
4 1017122 8 10 10 8 7 10.000000 9 7 1 4
5 1018099 1 1 1 1 2 10.000000 3 1 1 2
6 1018561 2 1 2 1 2 1.000000 3 1 1 2
7 1033078 2 1 1 1 2 1.000000 1 1 5 2
8 1033078 4 2 1 1 2 1.000000 2 1 1 2
9 1035283 1 1 1 1 1 1.000000 3 1 1 2
10 1036172 2 1 1 1 2 1.000000 2 1 1 2
11 1041801 5 3 3 3 2 3.000000 4 4 1 4
12 1043999 1 1 1 1 2 3.000000 3 1 1 2
13 1044572 8 7 5 10 7 9.000000 5 5 4 4
14 1047630 7 4 6 4 6 1.000000 4 3 1 4
15 1048672 4 1 1 1 2 1.000000 2 1 1 2
16 1049815 4 1 1 1 2 1.000000 3 1 1 2
17 1050670 10 7 7 6 4 10.000000 4 1 2 4
18 1050718 6 1 1 1 2 1.000000 3 1 1 2
19 1054590 7 3 2 10 5 10.000000 5 4 4 4
20 1054593 10 5 5 3 6 7.000000 7 10 1 4
21 1056784 3 1 1 1 2 1.000000 2 1 1 2
22 1057013 8 4 5 1 2 3.548387 7 3 1 4
23 1059552 1 1 1 1 2 1.000000 3 1 1 2
24 1065726 5 2 3 4 2 7.000000 3 6 1 4