Search code examples
machine-learningscikit-learnpipelinecross-validationfeature-selection

Put customized functions in Sklearn pipeline


In my classification scheme, there are several steps including:

  1. SMOTE (Synthetic Minority Over-sampling Technique)
  2. Fisher criteria for feature selection
  3. Standardization (Z-score normalisation)
  4. SVC (Support Vector Classifier)

The main parameters to be tuned in the scheme above are percentile (2.) and hyperparameters for SVC (4.) and I want to go through grid search for tuning.

The current solution builds a "partial" pipeline including step 3 and 4 in the scheme clf = Pipeline([('normal',preprocessing.StandardScaler()),('svc',svm.SVC(class_weight='auto'))]) and breaks the scheme into two parts:

  1. Tune the percentile of features to keep through the first grid search

    skf = StratifiedKFold(y)
    for train_ind, test_ind in skf:
        X_train, X_test, y_train, y_test = X[train_ind], X[test_ind], y[train_ind], y[test_ind]
        # SMOTE synthesizes the training data (we want to keep test data intact)
        X_train, y_train = SMOTE(X_train, y_train)
        for percentile in percentiles:
            # Fisher returns the indices of the selected features specified by the parameter 'percentile'
            selected_ind = Fisher(X_train, y_train, percentile) 
            X_train_selected, X_test_selected = X_train[selected_ind,:], X_test[selected_ind, :]
            model = clf.fit(X_train_selected, y_train)
            y_predict = model.predict(X_test_selected)
            f1 = f1_score(y_predict, y_test)
    

    The f1 scores will be stored and then be averaged through all fold partitions for all percentiles, and the percentile with the best CV score is returned. The purpose of putting 'percentile for loop' as the inner loop is to allow fair competition as we have the same training data (including synthesized data) across all fold partitions for all percentiles.

  2. After determining the percentile, tune the hyperparameters by second grid search

    skf = StratifiedKFold(y)
    for train_ind, test_ind in skf:
        X_train, X_test, y_train, y_test = X[train_ind], X[test_ind], y[train_ind], y[test_ind]
        # SMOTE synthesizes the training data (we want to keep test data intact)
        X_train, y_train = SMOTE(X_train, y_train)
        for parameters in parameter_comb:
            # Select the features based on the tuned percentile
            selected_ind = Fisher(X_train, y_train, best_percentile) 
            X_train_selected, X_test_selected = X_train[selected_ind,:], X_test[selected_ind, :]
            clf.set_params(svc__C=parameters['C'], svc__gamma=parameters['gamma'])
            model = clf.fit(X_train_selected, y_train)
            y_predict = model.predict(X_test_selected)
            f1 = f1_score(y_predict, y_test)
    

It is done in the very similar way, except we tune the hyperparamter for SVC rather than percentile of features to select.

My questions are:

  1. In the current solution, I only involve 3. and 4. in the clf and do 1. and 2. kinda "manually" in two nested loop as described above. Is there any way to include all four steps in a pipeline and do the whole process at once?

  2. If it is okay to keep the first nested loop, then is it possible (and how) to simplify the next nested loop using a single pipeline

    clf_all = Pipeline([('smote', SMOTE()),
                        ('fisher', Fisher(percentile=best_percentile))
                        ('normal',preprocessing.StandardScaler()),
                        ('svc',svm.SVC(class_weight='auto'))]) 
    

    and simply use GridSearchCV(clf_all, parameter_comb) for tuning?

    Please note that both SMOTE and Fisher (ranking criteria) have to be done only for the training data in each fold partition.

It would be so much appreciated for any comment.

SMOTE and Fisher are shown below:

def Fscore(X, y, percentile=None):
    X_pos, X_neg = X[y==1], X[y==0]
    X_mean = X.mean(axis=0)
    X_pos_mean, X_neg_mean = X_pos.mean(axis=0), X_neg.mean(axis=0)
    deno = (1.0/(shape(X_pos)[0]-1))*X_pos.var(axis=0) +(1.0/(shape(X_neg[0]-1))*X_neg.var(axis=0)
    num = (X_pos_mean - X_mean)**2 + (X_neg_mean - X_mean)**2
    F = num/deno
    sort_F = argsort(F)[::-1]
    n_feature = (float(percentile)/100)*shape(X)[1]
    ind_feature = sort_F[:ceil(n_feature)]
    return(ind_feature)

SMOTE is from https://github.com/blacklab/nyan/blob/master/shared_modules/smote.py, it returns the synthesized data. I modified it to return the original input data stacked with the synthesized data along with its labels and synthesized ones.

def smote(X, y):
    n_pos = sum(y==1), sum(y==0)
    n_syn = (n_neg-n_pos)/float(n_pos) 
    X_pos = X[y==1]
    X_syn = SMOTE(X_pos, int(round(n_syn))*100, 5)
    y_syn = np.ones(shape(X_syn)[0])
    X, y = np.vstack([X, X_syn]), np.concatenate([y, y_syn])
    return(X, y)

Solution

  • I don't know where your SMOTE() and Fisher() functions are coming from, but the answer is yes you can definitely do this. In order to do so you will need to write a wrapper class around those functions though. The easiest way to this is inherit sklearn's BaseEstimator and TransformerMixin classes, see this for an example: http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html

    If this isn't making sense to you, post the details of at least one of your functions (the library it comes from or your code if you wrote it yourself) and we can go from there.

    EDIT:

    I apologize, I didn't look at your functions closely enough to realize that they transform your target in addition to your training data (i.e. both X and y). Pipeline does not support transformations to your target so you will have do them prior as you originally were. For your reference, here is what it would look like to write your custom class for your Fisher process which would work if the function itself did not need to affect your target variable.

    >>> from sklearn.base import BaseEstimator, TransformerMixin
    >>> from sklearn.preprocessing import StandardScaler
    >>> from sklearn.svm import SVC
    >>> from sklearn.pipeline import Pipeline
    >>> from sklearn.grid_search import GridSearchCV
    >>> from sklearn.datasets import load_iris
    >>> 
    >>> class Fisher(BaseEstimator, TransformerMixin):
    ...     def __init__(self,percentile=0.95):
    ...             self.percentile = percentile
    ...     def fit(self, X, y):
    ...             from numpy import shape, argsort, ceil
    ...             X_pos, X_neg = X[y==1], X[y==0]
    ...             X_mean = X.mean(axis=0)
    ...             X_pos_mean, X_neg_mean = X_pos.mean(axis=0), X_neg.mean(axis=0)
    ...             deno = (1.0/(shape(X_pos)[0]-1))*X_pos.var(axis=0) + (1.0/(shape(X_neg)[0]-1))*X_neg.var(axis=0)
    ...             num = (X_pos_mean - X_mean)**2 + (X_neg_mean - X_mean)**2
    ...             F = num/deno
    ...             sort_F = argsort(F)[::-1]
    ...             n_feature = (float(self.percentile)/100)*shape(X)[1]
    ...             self.ind_feature = sort_F[:ceil(n_feature)]
    ...             return self
    ...     def transform(self, x):
    ...             return x[self.ind_feature,:]
    ... 
    >>> 
    >>> data = load_iris()
    >>> 
    >>> pipeline = Pipeline([
    ...     ('fisher', Fisher()),
    ...     ('normal',StandardScaler()),
    ...     ('svm',SVC(class_weight='auto'))
    ... ])
    >>> 
    >>> grid = {
    ...     'fisher__percentile':[0.75,0.50],
    ...     'svm__C':[1,2]
    ... }
    >>> 
    >>> model = GridSearchCV(estimator = pipeline, param_grid=grid, cv=2)
    >>> model.fit(data.data,data.target)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/grid_search.py", line 596, in fit
        return self._fit(X, y, ParameterGrid(self.param_grid))
      File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/grid_search.py", line 378, in _fit
        for parameters in parameter_iterable
      File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 653, in __call__
        self.dispatch(function, args, kwargs)
      File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 400, in dispatch
        job = ImmediateApply(func, args, kwargs)
      File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 138, in __init__
        self.results = func(*args, **kwargs)
      File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/cross_validation.py", line 1239, in _fit_and_score
        estimator.fit(X_train, y_train, **fit_params)
      File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/pipeline.py", line 130, in fit
        self.steps[-1][-1].fit(Xt, y, **fit_params)
      File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/svm/base.py", line 149, in fit
        (X.shape[0], y.shape[0]))
    ValueError: X and y have incompatible shapes.
    X has 1 samples, but y has 75.