I am trying to train my binary classifier over a huge data. Previously, I could accomplish training via using fit method of sklearn. But now, I have more data and I cannot cope with them. I am trying to fitting them partially but couldn't get rid of errors. How can I train my huge data incrementally? With applying my previous approach, I get an error about pipeline object. I have gone through the examples from Incremental Learning but still running these code samples gives error. I will appreciate any help.
X,y = transform_to_dataset(training_data)
clf = Pipeline([
('vectorizer', DictVectorizer()),
('classifier', LogisticRegression())])
length=len(X)/2
clf.partial_fit(X[:length],y[:length],classes=np.array([0,1]))
clf.partial_fit(X[length:],y[length:],classes=np.array([0,1]))
ERROR
AttributeError: 'Pipeline' object has no attribute 'partial_fit'
TRYING GIVEN CODE SAMPLES:
clf=SGDClassifier(alpha=.0001, loss='log', penalty='l2', n_jobs=-1,
#shuffle=True, n_iter=10,
verbose=1)
length=len(X)/2
clf.partial_fit(X[:length],y[:length],classes=np.array([0,1]))
clf.partial_fit(X[length:],y[length:],classes=np.array([0,1]))
ERROR
File "/home/kntgu/anaconda2/lib/python2.7/site-packages/sklearn/utils/validation.py", line 573, in check_X_y
ensure_min_features, warn_on_dtype, estimator)
File "/home/kntgu/anaconda2/lib/python2.7/site-packages/sklearn/utils/validation.py", line 433, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
TypeError: float() argument must be a string or a number
My dataset consists of some sentences with their part of speech tags and dependency relations.
Thanks NN 0 root
to IN 3 case
all DT 1 nmod
who WP 5 nsubj
volunteered VBD 3 acl:relcl
. . 1 punct
You PRP 3 nsubj
will MD 3 aux
remain VB 0 root
as IN 5 case
alternates NNS 3 obl
. . 3 punct
A Pipeline
object from scikit-learn does not have the partial_fit
, as seen in the docs.
The reason for this is that you can add any estimator you want to that Pipeline
object, and not all of them implement the partial_fit
. Here is a list of the supported estimators.
As you see, using SGDClassifier
(without Pipeline
), you don't get this "no attribute" error, because this specific estimator is supported. The error message you get for this one is probably due to text data. You can use the LabelEncoder to process the non-numeric columns.