Search code examples
pandasscikit-learnpipelineword2vec

Custom word2vec Transformer on pandas dataframe and using it in FeatureUnion


For the below pandas DataFrame df, I want to transform the type column to OneHotEncoding, and transform the word column to its vector representation using the dictionary word2vec. Then I want to concatenate the two transformed vectors with the count column to form the final feature for classification.

>>> df
       word type  count
0     apple    A      4
1       cat    B      3
2  mountain    C      1 

>>> df.dtypes
word       object
type     category
count       int64

>>> word2vec
{'apple': [0.1, -0.2, 0.3], 'cat': [0.2, 0.2, 0.3], 'mountain': [0.4, -0.2, 0.3]}

I defined customized Transformer, and use FeatureUnion to concatenate the features.

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import OneHotEncoder

class w2vTransformer(TransformerMixin):

    def __init__(self,word2vec):
        self.word2vec = word2vec

    def fit(self,x, y=None):
        return self

    def wv(self, w):
        return self.word2vec[w] if w in self.word2vec else [0, 0, 0]

    def transform(self, X, y=None):
         return df['word'].apply(self.wv)

pipeline = Pipeline([
    ('features', FeatureUnion(transformer_list=[
        # Part 1: get integer column
        ('numericals', Pipeline([
            ('selector', TypeSelector(np.number)),
        ])),

        # Part 2: get category column and its onehotencoding
        ('categoricals', Pipeline([
            ('selector', TypeSelector('category')),
            ('labeler', StringIndexer()),
            ('encoder', OneHotEncoder(handle_unknown='ignore')),
        ])), 

        # Part 3: transform word to its embedding
        ('word2vec', Pipeline([
            ('w2v', w2vTransformer(word2vec)),
        ]))
    ])),
])

When I run pipeline.fit_transform(df), I got the error: blocks[0,:] has incompatible row dimensions. Got blocks[0,2].shape[0] == 1, expected 3.

However, if I removed the word2vec Transformer (Part 3) from the pipeline, the pipeline (Part1 1 + Part 2) works fine.

>>> pipeline_no_word2vec.fit_transform(df).todense()
matrix([[4., 1., 0., 0.],
        [3., 0., 1., 0.],
        [1., 0., 0., 1.]])

And if I keep only the w2v transformer in the pipeline, it also works.

>>> pipeline_only_word2vec.fit_transform(df)
array([list([0.1, -0.2, 0.3]), list([0.2, 0.2, 0.3]),
       list([0.4, -0.2, 0.3])], dtype=object)

My guess is that there is something wrong in my w2vTransformer class but don't know how to fix it. Please help.


Solution

  • This error is due to the fact that the FeatureUnion expects a 2-d array from each of its parts.

    Now the first two parts of your FeatureUnion:- 'numericals' and 'categoricals' are correctly sending 2-d data of shape (n_samples, n_features).

    n_samples = 3 in your example data. n_features will depend on individual parts (like OneHotEncoder will change them in 2nd part, but will be 1 in first part).

    But the third part 'word2vec' returns a pandas.Series object which have the 1-d shape (3,). FeatureUnion takes this a shape (1, 3) by default and hence the complains that it does not match other blocks.

    So you need to correct that shape.

    Now even if you simply do a reshape() at the end and change it to shape (3,1), your code will not run, because the internal contents of that array are lists from your word2vec dict, which are not transformed correctly to a 2-d array. Instead it will become a array of lists.

    Change the w2vTransformer to correct the error:

    class w2vTransformer(TransformerMixin):
        ...
        ...
        def transform(self, X, y=None):
            return np.array([np.array(vv) for vv in X['word'].apply(self.wv)])
    

    And after that the pipeline will work.