First of all thanks in advance, I don't really know if I should open an issue so I wanted to check if someone had faced this before.
So I'm having the following problem when using a CalibratedClassifierCV for text classification. I have an estimator which is a pipeline created this way (simple example):
# Import libraries first
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.calibration import CalibratedClassifierCV
from sklearn.linear_model import LogisticRegression
# Now create the estimators: pipeline -> calibratedclassifier(pipeline)
pipeline = make_pipeline( TfidfVectorizer(), LogisticRegression() )
calibrated_pipeline = CalibratedClassifierCV( pipeline, cv=2 )
Now we can create a simple train set to check if the classifier works:
# Create text and labels arrays
text_array = np.array(['Why', 'is', 'this', 'happening'])
outputs = np.array([0,1,0,1])
When I try to fit the calibrated_pipeline object, I get this error:
ValueError: Found input variables with inconsistent numbers of samples: [1, 4]
If you want I can copy the whole exception trace, but this should be easily reproducible. Thanks a lot in advance!
EDIT: I made a mistake when creating the arrays. Fixed now (Thanks @ogrisel !) Also, calling:
pipeline.fit(text_array, outputs)
works properly, but doing so with the calibrated classifier fails!
np.array(['Why', 'is', 'this', 'happening']).reshape(-1,1)
is a 2D array of strings while the docstring of the fit_transform method of the TfidfVectorizer class states that it expects:
Parameters
----------
raw_documents : iterable
an iterable which yields either str, unicode or file objects
If you iterate over your 2D numpy array you get a sequence of 1D arrays of strings instead of strings directly:
>>> list(text_array)
[array(['Why'],
dtype='<U9'), array(['is'],
dtype='<U9'), array(['this'],
dtype='<U9'), array(['happening'],
dtype='<U9')]
So the fix is easy, just pass:
text_documents = ['Why', 'is', 'this', 'happening']
as the raw input to the vectorizer.
Edit: remark: LogisticRegression
is almost always a well calibrated classifier by default. It will likely be the case that CalibratedClassifierCV
won't bring anything in this case.