Search code examples
pythonmachine-learningscikit-learndictvectorizer

AttributeError: 'Pipeline' object has no attribute 'partial_fit'


I am trying to train my binary classifier over a huge data. Previously, I could accomplish training via using fit method of sklearn. But now, I have more data and I cannot cope with them. I am trying to fitting them partially but couldn't get rid of errors. How can I train my huge data incrementally? With applying my previous approach, I get an error about pipeline object. I have gone through the examples from Incremental Learning but still running these code samples gives error. I will appreciate any help.

X,y = transform_to_dataset(training_data)

clf = Pipeline([
    ('vectorizer', DictVectorizer()),
    ('classifier', LogisticRegression())])

length=len(X)/2

clf.partial_fit(X[:length],y[:length],classes=np.array([0,1]))

clf.partial_fit(X[length:],y[length:],classes=np.array([0,1]))

ERROR

AttributeError: 'Pipeline' object has no attribute 'partial_fit'

TRYING GIVEN CODE SAMPLES:

clf=SGDClassifier(alpha=.0001, loss='log', penalty='l2', n_jobs=-1,
                      #shuffle=True, n_iter=10, 
                      verbose=1)
length=len(X)/2

clf.partial_fit(X[:length],y[:length],classes=np.array([0,1]))

clf.partial_fit(X[length:],y[length:],classes=np.array([0,1]))

ERROR

File "/home/kntgu/anaconda2/lib/python2.7/site-packages/sklearn/utils/validation.py", line 573, in check_X_y
ensure_min_features, warn_on_dtype, estimator)
File "/home/kntgu/anaconda2/lib/python2.7/site-packages/sklearn/utils/validation.py", line 433, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
TypeError: float() argument must be a string or a number

My dataset consists of some sentences with their part of speech tags and dependency relations.

Thanks  NN  0   root
to  IN  3   case
all DT  1   nmod
who WP  5   nsubj
volunteered VBD 3   acl:relcl
.   .   1   punct

You PRP 3   nsubj
will    MD  3   aux
remain  VB  0   root
as  IN  5   case
alternates  NNS 3   obl
.   .   3   punct

Solution

  • A Pipeline object from scikit-learn does not have the partial_fit, as seen in the docs.

    The reason for this is that you can add any estimator you want to that Pipeline object, and not all of them implement the partial_fit. Here is a list of the supported estimators.

    As you see, using SGDClassifier (without Pipeline), you don't get this "no attribute" error, because this specific estimator is supported. The error message you get for this one is probably due to text data. You can use the LabelEncoder to process the non-numeric columns.