How to use t-SNE inside the pipeline

How could I use t-SNE inside my pipeline? I have managed without pipelining to successfully run t-SNE and on it a classification algorithm. Do I need to write a custom method that can be called in the pipeline that returns a dataframe, or how does it work?

# How I used t-SNE
%%time

from sklearn.manifold import TSNE
X_std = StandardScaler().fit_transform(dfListingsFeature_classification)
ts = TSNE()
X_tsne = ts.fit_transform(X_std)

print(X_tsne.shape)
feature_list = []
for i in range(1,X_tsne.shape[1]+1):
    feature_list .append("TSNE" + str(i))
    
df_new = pd.DataFrame(X_tsne, columns= feature_list )

df_new['label'] = y
#df_new.head()

X = df_new.drop(columns=['label'])
y = df_new['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1, stratify=y) 
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) 
rfc= RandomForestClassifier()

# Train Decision Tree Classifer
rfc= rfc.fit(X_train,y_train)

#Predict the response for test dataset
y_pred = rfc.predict(X_test)

What I want to use it

# How could I use TSNE() inside the the pipeline? 
%%time
steps = [('standardscaler', StandardScaler()),
         ('tsne', TSNE()),
         ('rfc', RandomForestClassifier())]

pipeline = Pipeline(steps) 

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=30)


parameteres = {'rfc__max_depth':[1,2,3,4,5,6,7,8,9,10,11,12],
               'rfc__criterion':['gini', 'entropy']}

grid = GridSearchCV(pipeline, param_grid=parameteres, cv=5)
                 
grid.fit(X_train, y_train)
                    
print("score = %3.2f" %(grid.score(X_test,y_test)))
print('Training set score: ' + str(grid.score(X_train,y_train)))
print('Test set score: ' + str(grid.score(X_test,y_test)))
print(grid.best_params_)

y_pred = grid.predict(X_test)

print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print("Precison:",metrics.precision_score(y_test, y_pred))
print("Recall:",metrics.recall_score(y_test, y_pred))

[OUT] TypeError: All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' 'TSNE()' (type <class 'sklearn.manifold._t_sne.TSNE'>) doesn't

Should I build a custom method or how ? If so how should it look like ?

class TestTSNE(BaseEstimator, TransformerMixin):
  def __init__(self):
    # don't know

  def fit(self, X, y = None):
    X_std = StandardScaler().fit_transform(dfListingsFeature_classification)
    ts = TSNE()
    X_tsne = ts.fit_transform(X_std)
    return self
    

  def transform(self, X, y = None):
    feature_list = []
    for i in range(1,shelf.X_tsne.shape[1]+1):
        feature_list .append("TSNE" + str(i))
    
    df_new = pd.DataFrame(X_tsne, columns= feature_list )

    df_new['label'] = y
    #df_new.head()

    X = df_new.drop(columns=['label'])
    y = df_new['label']
    return X, y
...
steps = [('standardscaler', StandardScaler()),
         ('testTSNE', TestTSNE()),
         ('rfc', RandomForestClassifier())]

pipeline = Pipeline(steps)

Solution

I think you misunderstood the use of pipeline. From help page:

Pipeline of transforms with a final estimator.

Sequentially apply a list of transforms and a final estimator. Intermediate steps of the pipeline must be ‘transforms’, that is, they must implement fit and transform methods. The final estimator only needs to implement fit

So this means if your pipeline is:

steps = [('standardscaler', StandardScaler()),
         ('tsne', TSNE()),
         ('rfc', RandomForestClassifier())]

You are going to apply standscaler to your features first, then transform the result of this with tsne, before passing it to the classifier. I don't think it makes much sense to train on the tsne output.

If you really want to latch onto pipeline, then you will need to store the results of tsne as an attribute, then just return the feature, training as it is, so that the classifier can work on it.

Something like

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.manifold import TSNE
from sklearn.datasets import make_classification

class TestTSNE(BaseEstimator, TransformerMixin):
    def __init__(self,n_components,random_state=None,method='exact'):
        self.n_components = n_components
        self.method = method
        self.random_state = random_state

    def fit(self, X, y = None):
        ts = TSNE(n_components = self.n_components,
        method = self.method, random_state = self.random_state)
        self.X_tsne = ts.fit_transform(X)
        return self

    def transform(self, X, y = None):
        return X

Then:

steps = [('standardscaler', StandardScaler()),
         ('testTSNE', TestTSNE(2)),
         ('rfc', RandomForestClassifier())]

pipeline = Pipeline(steps)
X,y = make_classification()
pipeline.fit(X,y)

You can retrieve your tsne like this:

pd.DataFrame(pipeline.steps[1][1].X_tsne)


            0          1
0  -38.756626  -4.693253
1   46.516308  53.633842
2   49.107910  16.482645
3   18.306377   9.432504
4   33.551056 -27.441383
..        ...        ...
95 -31.337574 -16.913471
96 -57.918224 -39.959976
97  55.282658  37.582535
98  66.425125  19.717241
99 -50.692646  11.545088