I am trying to impute all missing data (as indicated by a '?') into NaN
and use the imputation tool from sklearn
to average them into a mean value. To be reproducible on my problem, I have included my code as below: I use PyCharm as IDE, Mac OS X and anaconda on Py 2.7.12
This is my code:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/communities/communities.data', header=None, sep=',\s', na_values=["?"])
df.tail()
from sklearn.preprocessing import Imputer
imr = Imputer(missing_values='NaN', strategy='mean', axis=0)
imr= imr.fit(df)
And here is my error message
/Users/zdong/anaconda/bin/python/Users/zdong/PycharmProjects/ml/crim_workingfile.py
/Users/zdong/PycharmProjects/ml/crim_workingfile.py:4: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning databases/communities/communities.data', header=None, sep=',\s', na_values=["?"])
Traceback (most recent call last): File "/Users/zdong/PycharmProjects/535_final/535_workingfile.py", line 8, in <module> imr= imr.fit(df) File "/Users/zdong/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/imputation.py", line 156, in fit force_all_finite=False) File "/Users/zdong/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py" line 382, in check_array array = np.array(array, dtype=dtype, order=order, copy=copy) ValueError: invalid literal for float(): 6,?,?,Ontariocity,10,0.2,0.78,0.14,0.46,0.24,0.77,0.5,0.62,0.4,0.17,0.21,1,0.4,0.73,0.22,0.25,0.26,0.47,0.29,0.36,0.24,0.28,0.32,0.22,0.27,0.25,0.29,0.16,0.35,0.5,0.55,0.16,0.47,0.58,0.53,0.2,0.6,0.24
Please help me the devastated beginner QAQ...
Okay I think there's enough here for an actual answer. Looking at your data, the first 5 columns look like info about the cities (name, other values >= 1), and the rest look like the data you're interested in for the fit
you do on the last line.
Your issue is that the fit tries to cast all the data to a float, and obviously fails on the city names. The data passed into the fit should probably be everything except the first 5 columns (maybe 4, if column 5 is the bias?). Either way, try something like:
df = pd.read_csv('communities.data', header=None, na_values=["?"], usecols=range(5, 128))
and change the 5 depending on which columns you need.