python scikit-learn pipeline calibration

Bug with CalibratedClassifierCV when using a Pipeline with TF-IDF?

First of all thanks in advance, I don't really know if I should open an issue so I wanted to check if someone had faced this before.

So I'm having the following problem when using a CalibratedClassifierCV for text classification. I have an estimator which is a pipeline created this way (simple example):

# Import libraries first
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.calibration import CalibratedClassifierCV
from sklearn.linear_model import LogisticRegression

# Now create the estimators: pipeline -> calibratedclassifier(pipeline)
pipeline = make_pipeline( TfidfVectorizer(), LogisticRegression() )
calibrated_pipeline = CalibratedClassifierCV( pipeline, cv=2 )

Now we can create a simple train set to check if the classifier works:

# Create text and labels arrays
text_array = np.array(['Why', 'is', 'this', 'happening'])
outputs = np.array([0,1,0,1])

When I try to fit the calibrated_pipeline object, I get this error:

ValueError: Found input variables with inconsistent numbers of samples: [1, 4]

If you want I can copy the whole exception trace, but this should be easily reproducible. Thanks a lot in advance!

EDIT: I made a mistake when creating the arrays. Fixed now (Thanks @ogrisel !) Also, calling:

pipeline.fit(text_array, outputs)

works properly, but doing so with the calibrated classifier fails!

Solution

np.array(['Why', 'is', 'this', 'happening']).reshape(-1,1) is a 2D array of strings while the docstring of the fit_transform method of the TfidfVectorizer class states that it expects:

    Parameters
    ----------
    raw_documents : iterable
        an iterable which yields either str, unicode or file objects

If you iterate over your 2D numpy array you get a sequence of 1D arrays of strings instead of strings directly:

>>> list(text_array)
[array(['Why'], 
      dtype='<U9'), array(['is'], 
      dtype='<U9'), array(['this'], 
      dtype='<U9'), array(['happening'], 
      dtype='<U9')]

So the fix is easy, just pass:

text_documents = ['Why', 'is', 'this', 'happening']

as the raw input to the vectorizer.

Edit: remark: LogisticRegression is almost always a well calibrated classifier by default. It will likely be the case that CalibratedClassifierCV won't bring anything in this case.