Search code examples
python-3.xpandasscikit-learnsklearn-pandas

How to test unseen test data with cross validation and predict labels?


1.The CSV that contains data(ie. text description) along with categorized labels

df = pd.read_csv('./output/csv_sanitized_16_.csv', dtype=str)
X = df['description_plus']
y = df['category_id']

2.This CSV contains unseen data(ie. text description) for which labels need to be predicted

df_2 = pd.read_csv('./output/csv_sanitized_2.csv', dtype=str)
X2 = df_2['description_plus']

Cross validation function that operates on the training data(item #1) above.

def cross_val():
    cv = KFold(n_splits=20)
    vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
                                     stop_words='english')
    X_train = vectorizer.fit_transform(X) 
    clf = make_pipeline(preprocessing.StandardScaler(with_mean=False), svm.SVC(C=1))
    scores = cross_val_score(clf, X_train, y, cv=cv)
    print(scores)
    print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
cross_val()

I need to know how to pass the unseen data(item #2) to the cross validation function and how to predict the labels?


Solution

  • Using scores = cross_val_score(clf, X_train, y, cv=cv) you can only get the cross-validated scores of the model. cross_val_score will internally split the data into training and testing based on the cv parameter.

    So the values that you get are the cross-validated accuracy of the SVC.

    To get the score on the unseen data, you can first fit the model e.g.

    clf = make_pipeline(preprocessing.StandardScaler(with_mean=False), svm.SVC(C=1))
    clf.fit(X_train, y) # the model is trained now
    

    and then do clf.score(X_unseen,y)

    The last will return the accuracy of the model on the unseen data.


    EDIT: The best way to do what you want is the following using a GridSearch to first find the best model using the training data and then evaluate the best model using the unseen (test) data:

    from sklearn import svm, datasets
    from sklearn.model_selection import GridSearchCV
    from sklearn.model_selection import train_test_split
    from sklearn.model_selection import cross_val_score
    
    # load some data
    iris = datasets.load_iris()
    X, y = iris.data, iris.target
    
    #split data to training and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)
    
    # hyperparameter tunig of the SVC model
    parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
    svc = svm.SVC()
    
    # fit the GridSearch using the TRAINING data
    grid_searcher = GridSearchCV(svc, parameters)
    grid_searcher.fit(X_train, y_train)
    
    #recover the best estimator (best parameters for the SVC, based on the GridSearch)
    best_SVC_model = grid_searcher.best_estimator_
    
    # Now, check how this best model behaves on the test set
    cv_scores_on_unseen = cross_val_score(best_SVC_model, X_test, y_test, cv=5)
    print(cv_scores_on_unseen.mean())