Search code examples
pythonpandasscikit-learncountvectorizer

How to add a second feature to a countvectorized feature using sklearn?


I have 3 columns in my data set:

Review: A product review

Type: A category or product type

Cost: How much the product cost

This is a multiclass problem, with Type as the target variable. There are 64 different Types of products in this dataset.

Review and Cost are my two features.

I've split the data into 4 sets with the Type variable removed:

X = data.drop('type', axis = 1)
y = data.type
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

For Review, I am using the following to vectorize it:

vect = CountVectorizer(stop_words = stop)
X_train_dtm = vect.fit_transform(X_train.review)

Here's where I am stuck!

In order to run the model I need to have both my features in the training set, however, since X_train_dtm is a sparse matrix, I am unsure as to how I concatenate my pandas series Cost feature to that sparse matrix. Since the data is already numerical for the Cost, I don't think that I need to transform it, which is why I have not used something like "FeatureUnion" which combines 2 transformed features.

Any help would be appreciated!!

Example data:

| Review           | Cost        | Type         |
|:-----------------|------------:|:------------:|
| This is a review |        200  |     Toy     
| This is a review |        100  |     Toy    
| This is a review |        800  |  Electronics     
| This is a review |         35  |     Home      

Update

After applying tarashypka's solution I was able to rid add the second feature to the X_train_dtm. However, I am getting an error when attempting to run the same on the test set:

from scipy.sparse import hstack

vect = CountVectorizer(stop_words = stop)
X_train_dtm = vect.fit_transform(X_train.review)
prices = X_train.prices.values[:,None]
X_train_dtm = hstack((X_train_dtm, prices))

#Works perfectly for the training set above
#But when I run with test set I get the following error
X_test_dtm = vect.transform(X_test)
prices_test = X_test.prices.values[:,None]
X_test_dtm = hstack((X_test_dtm, prices_test))

Traceback (most recent call last):

  File "<ipython-input-10-b2861d63b847>", line 8, in <module>
    X_test_dtm = hstack((X_test_dtm, points_test))

  File "C:\Users\k\Anaconda3\lib\site-packages\scipy\sparse\construct.py", line 464, in hstack
    return bmat([blocks], format=format, dtype=dtype)

  File "C:\Users\k\Anaconda3\lib\site-packages\scipy\sparse\construct.py", line 581, in bmat
    'row dimensions' % i)

ValueError: blocks[0,:] has incompatible row dimensions

Solution

  • The result of CountVectorizer, in your case X_train_dtm, is of type scipy.sparse.csr_matrix. If you don't want to convert it to the numpy array, then scipy.sparse.hstack is the way to add another column

    >> from scipy.sparse import hstack
    >> prices = X_train['Cost'].values[:, None]
    >> X_train_dtm = hstack((X_train_dtm, prices))