Search code examples
pythonpython-3.xmachine-learningscikit-learnscikit-learn-pipeline

It is necessary to encode labels when using `TfidfVectorizer`, `CountVectorizer` etc?


When working with text data, I understand the need to encode text labels into some numeric representation (i.e., by using LabelEncoder, OneHotEncoder etc.)

However, my question is whether you need to perform this step explicitly when you're using some feature extraction class (i.e. TfidfVectorizer, CountVectorizer etc.) or whether these will encode the labels under the hood for you?

If you do need to encode the labels separately yourself, are you able to perform this step in a Pipeline (such as the one below)

    pipeline = Pipeline(steps=[
        ('tfidf', TfidfVectorizer()),
        ('sgd', SGDClassifier())
    ])

Or do you need encode the labels beforehand since the pipeline expects to fit() and transform() the data (not the labels)?


Solution

  • Have a look into the scikit-learn glossary for the term transform:

    In a transformer, transforms the input, usually only X, into some transformed space (conventionally notated as Xt). Output is an array or sparse matrix of length n_samples and with the number of columns fixed after fitting.

    In fact, almost all transformers only transform the features. This holds true for TfidfVectorizer and CountVectorizer as well. If ever in doubt, you can always check the return type of the transforming function (like the fit_transform method of CountVectorizer).

    Same goes when you assemble several transformers in a pipeline. It is stated in its user guide:

    Transformers are usually combined with classifiers, regressors or other estimators to build a composite estimator. The most common tool is a Pipeline. Pipeline is often used in combination with FeatureUnion which concatenates the output of transformers into a composite feature space. TransformedTargetRegressor deals with transforming the target (i.e. log-transform y). In contrast, Pipelines only transform the observed data (X).

    So in conclusion, you typically handle the labels separately and before you fit the estimator/pipeline.