Search code examples
pythonscikit-learnpipeline

How to use t-SNE inside the pipeline


How could I use t-SNE inside my pipeline? I have managed without pipelining to successfully run t-SNE and on it a classification algorithm. Do I need to write a custom method that can be called in the pipeline that returns a dataframe, or how does it work?

# How I used t-SNE
%%time

from sklearn.manifold import TSNE
X_std = StandardScaler().fit_transform(dfListingsFeature_classification)
ts = TSNE()
X_tsne = ts.fit_transform(X_std)

print(X_tsne.shape)
feature_list = []
for i in range(1,X_tsne.shape[1]+1):
    feature_list .append("TSNE" + str(i))
    
df_new = pd.DataFrame(X_tsne, columns= feature_list )

df_new['label'] = y
#df_new.head()

X = df_new.drop(columns=['label'])
y = df_new['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1, stratify=y) 
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) 
rfc= RandomForestClassifier()

# Train Decision Tree Classifer
rfc= rfc.fit(X_train,y_train)

#Predict the response for test dataset
y_pred = rfc.predict(X_test)

What I want to use it

# How could I use TSNE() inside the the pipeline? 
%%time
steps = [('standardscaler', StandardScaler()),
         ('tsne', TSNE()),
         ('rfc', RandomForestClassifier())]

pipeline = Pipeline(steps) 

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=30)


parameteres = {'rfc__max_depth':[1,2,3,4,5,6,7,8,9,10,11,12],
               'rfc__criterion':['gini', 'entropy']}

grid = GridSearchCV(pipeline, param_grid=parameteres, cv=5)
                 
grid.fit(X_train, y_train)
                    
print("score = %3.2f" %(grid.score(X_test,y_test)))
print('Training set score: ' + str(grid.score(X_train,y_train)))
print('Test set score: ' + str(grid.score(X_test,y_test)))
print(grid.best_params_)

y_pred = grid.predict(X_test)

print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print("Precison:",metrics.precision_score(y_test, y_pred))
print("Recall:",metrics.recall_score(y_test, y_pred))
[OUT] TypeError: All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' 'TSNE()' (type <class 'sklearn.manifold._t_sne.TSNE'>) doesn't

Should I build a custom method or how ? If so how should it look like ?

class TestTSNE(BaseEstimator, TransformerMixin):
  def __init__(self):
    # don't know

  def fit(self, X, y = None):
    X_std = StandardScaler().fit_transform(dfListingsFeature_classification)
    ts = TSNE()
    X_tsne = ts.fit_transform(X_std)
    return self
    

  def transform(self, X, y = None):
    feature_list = []
    for i in range(1,shelf.X_tsne.shape[1]+1):
        feature_list .append("TSNE" + str(i))
    
    df_new = pd.DataFrame(X_tsne, columns= feature_list )

    df_new['label'] = y
    #df_new.head()

    X = df_new.drop(columns=['label'])
    y = df_new['label']
    return X, y
...
steps = [('standardscaler', StandardScaler()),
         ('testTSNE', TestTSNE()),
         ('rfc', RandomForestClassifier())]

pipeline = Pipeline(steps) 

Solution

  • I think you misunderstood the use of pipeline. From help page:

    Pipeline of transforms with a final estimator.

    Sequentially apply a list of transforms and a final estimator. Intermediate steps of the pipeline must be ‘transforms’, that is, they must implement fit and transform methods. The final estimator only needs to implement fit

    So this means if your pipeline is:

    steps = [('standardscaler', StandardScaler()),
             ('tsne', TSNE()),
             ('rfc', RandomForestClassifier())]
    

    You are going to apply standscaler to your features first, then transform the result of this with tsne, before passing it to the classifier. I don't think it makes much sense to train on the tsne output.

    If you really want to latch onto pipeline, then you will need to store the results of tsne as an attribute, then just return the feature, training as it is, so that the classifier can work on it.

    Something like

    from sklearn.base import BaseEstimator, TransformerMixin
    from sklearn.manifold import TSNE
    from sklearn.datasets import make_classification
    
    class TestTSNE(BaseEstimator, TransformerMixin):
        def __init__(self,n_components,random_state=None,method='exact'):
            self.n_components = n_components
            self.method = method
            self.random_state = random_state
    
        def fit(self, X, y = None):
            ts = TSNE(n_components = self.n_components,
            method = self.method, random_state = self.random_state)
            self.X_tsne = ts.fit_transform(X)
            return self
    
        def transform(self, X, y = None):
            return X
    

    Then:

    steps = [('standardscaler', StandardScaler()),
             ('testTSNE', TestTSNE(2)),
             ('rfc', RandomForestClassifier())]
    
    pipeline = Pipeline(steps)
    X,y = make_classification()
    pipeline.fit(X,y)
    

    You can retrieve your tsne like this:

    pd.DataFrame(pipeline.steps[1][1].X_tsne)
    
    
                0          1
    0  -38.756626  -4.693253
    1   46.516308  53.633842
    2   49.107910  16.482645
    3   18.306377   9.432504
    4   33.551056 -27.441383
    ..        ...        ...
    95 -31.337574 -16.913471
    96 -57.918224 -39.959976
    97  55.282658  37.582535
    98  66.425125  19.717241
    99 -50.692646  11.545088