Search code examples
machine-learningmissing-dataimputation

Errors of turning question mark(‘?’) into NaN while imputing a machine learning data


I am trying to impute all missing data (as indicated by a '?') into NaN and use the imputation tool from sklearn to average them into a mean value. To be reproducible on my problem, I have included my code as below: I use PyCharm as IDE, Mac OS X and anaconda on Py 2.7.12

This is my code:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/communities/communities.data', header=None, sep=',\s', na_values=["?"])
df.tail()
from sklearn.preprocessing import Imputer
imr = Imputer(missing_values='NaN', strategy='mean', axis=0)
imr= imr.fit(df)

And here is my error message

 /Users/zdong/anaconda/bin/python/Users/zdong/PycharmProjects/ml/crim_workingfile.py
/Users/zdong/PycharmProjects/ml/crim_workingfile.py:4: ParserWarning:   Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning databases/communities/communities.data', header=None, sep=',\s', na_values=["?"])
Traceback (most recent call last):
  File "/Users/zdong/PycharmProjects/535_final/535_workingfile.py", line 8,
in <module>
imr= imr.fit(df)
  File "/Users/zdong/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/imputation.py",
line 156, in fit
force_all_finite=False)
  File "/Users/zdong/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py"
line 382, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: invalid literal for float(): 6,?,?,Ontariocity,10,0.2,0.78,0.14,0.46,0.24,0.77,0.5,0.62,0.4,0.17,0.21,1,0.4,0.73,0.22,0.25,0.26,0.47,0.29,0.36,0.24,0.28,0.32,0.22,0.27,0.25,0.29,0.16,0.35,0.5,0.55,0.16,0.47,0.58,0.53,0.2,0.6,0.24

Please help me the devastated beginner QAQ...


Solution

  • Okay I think there's enough here for an actual answer. Looking at your data, the first 5 columns look like info about the cities (name, other values >= 1), and the rest look like the data you're interested in for the fit you do on the last line.

    Your issue is that the fit tries to cast all the data to a float, and obviously fails on the city names. The data passed into the fit should probably be everything except the first 5 columns (maybe 4, if column 5 is the bias?). Either way, try something like:

    df = pd.read_csv('communities.data', header=None, na_values=["?"], usecols=range(5, 128))
    

    and change the 5 depending on which columns you need.