python scikit-learn nlp text-classification

Sklearn Pipeline ValueError: could not convert string to float

I'm playing around with sklearn and NLP for the first time, and thought I understood everything I was doing up until I didn't know how to fix this error. Here is the relevant code (largely adapted from http://zacstewart.com/2015/04/28/document-classification-with-scikit-learn.html):

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import TruncatedSVD
from sgboost import XGBClassifier
from pandas import DataFrame

def read_files(path):
    for article in os.listdir(path):
        with open(os.path.join(path, doc)) as f:
            text = f.read()
        yield os.path.join(path, article), text

def build_data_frame(path, classification)
    rows = []
    index = []
    for filename, text in read_files(path):
        rows.append({'text': text, 'class': classification})
        index.append(filename)
    df = DataFrame(rows, index=index)
    return df

data = DataFrame({'text': [], 'class': []})
for path, classification in SOURCES: # SOURCES is a list of tuples
    data = data.append(build_data_frame(path, classification))
data = data.reindex(np.random.permutation(data.index))

classifier = Pipeline([
    ('features', FeatureUnion([
        ('text', Pipeline([
            ('tfidf', TfidfVectorizer()),
            ('svd', TruncatedSVD(algorithm='randomized', n_components=300)
            ])),
        ('words', Pipeline([('wscaler', StandardScaler())])),
    ])),
    ('clf, XGBClassifier(silent=False)),
])
classifier.fit(data['text'].values, data['class'].values)

The data loaded into the DataFrame is preprocessed text with all stopwords, punctuation, unicode, capitals, etc. taken care of. This is the error I'm getting once I call fit on the classifier where the ... represents one of the documents that should have been vecorized in the pipeline:

ValueError: could not convert string to float: ...

I first thought the TfidfVectorizer() is not working, causing an error on the SVD algorithm, but after I extracted each step out of the pipeline and implemented them sequentially, the same error only came up on XGBClassifer.fit().

Even more confusing to me, I tried to piece this script apart step-by-step in the interpreter, but when I tried to import either read_files or build_data_frame, the same ValueError came up with one of my strings, but this was merely after:

from classifier import read_files

I have no idea how that could be happening, if anyone has any idea what my glaring errors may be, I'd really appreciate it. Trying to wrap my head around these concepts on my own but coming across a problem likes this leaves me feeling pretty incapacitated.

Solution

First part of your pipeline is a FeatureUnion. FeatureUnion will pass all the data it gets parallely to all internal parts. The second part of your FeatureUnion is a Pipeline containing single StandardScaler. Thats the source of error.

This is your data flow:

X --> classifier, Pipeline
            |
            |  <== X is passed to FeatureUnion
            \/
      features, FeatureUnion
                      |
                      |  <== X is duplicated and passed to both parts
        ______________|__________________
       |                                 |
       |  <===   X contains text  ===>   |                         
       \/                               \/
   text, Pipeline                   words, Pipeline
           |                                  |   
           |  <===    Text is passed  ===>    |
          \/                                 \/ 
       tfidf, TfidfVectorizer            wscaler, StandardScaler  <== Error
                 |                                   |
                 | <==Text converted to floats       |
                \/                                   |
              svd, TruncatedSVD                      |
                       |                             |
                       |                             |
                      \/____________________________\/
                                      |
                                      |
                                     \/
                                   clf, XGBClassifier

Since text is passed to StandardScaler, the error is thrown, StandardScaler can only work with numerical features.

Just as you are converting text to numbers using TfidfVectorizer, before sending that to TruncatedSVD, you need to do the same before StandardScaler, or else only provide numerical features to it.

Looking at the description in question, did you intend to keep StandardScaler after the results of TruncatedSVD?