Search code examples
pythonscikit-learncross-validation

cross validation for split test and train datasets


Unlike standart data, I have dataset contain separetly as train, test1 and test2. I implemented ML algorithms and got performance metrics. But when i apply cross validation, it's getting complicated.. May be someone help me.. Thank you..

It's my code..

train = pd.read_csv('train-alldata.csv',sep=";")
test = pd.read_csv('test1-alldata.csv',sep=";")
test2 = pd.read_csv('test2-alldata.csv',sep=";")

X_train = train_pca_son.drop('churn_yn',axis=1)
y_train = train_pca_son['churn_yn']

X_test = test_pca_son.drop('churn_yn',axis=1)
y_test = test_pca_son['churn_yn']

X_test_2 = test2_pca_son.drop('churn_yn',axis=1)
y_test_2 = test2_pca_son['churn_yn']

For example, KNN Classifier.

knn_classifier = KNeighborsClassifier(n_neighbors =7,metric='euclidean')
knn_classifier.fit(X_train, y_train)

For K-Fold.

from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import KFold, cross_val_score


dtc = DecisionTreeClassifier(random_state=42)

k_folds = KFold(n_splits = 5)

scores = cross_val_score(dtc, X, y, cv = k_folds)

print("Cross Validation Scores: ", scores)
print("Average CV Score: ", scores.mean())
print("Number of CV Scores used in Average: ", len(scores))

Solution

  • This is a variation on the "holdout test data" pattern (see also: Wikipedia: Training, Validation, Test / Confusion in terminology). For churn prediction: this may arise if you have two types of customers, or are evaluating on two time frames.

    X_train, y_train    ← perform training and hyperparameter tuning with this
    X_test1, y_test1    ← test on this
    X_test2, y_test2    ← test on this as well
    

    Cross validation estimates holdout error using the training data—it may come up if you estimate hyperparameters with GridSearchCV. Final evaluation involves estimating performance on two test sets, separately or averaged over the two:

    from sklearn.datasets import make_classification
    from sklearn.model_selection import train_test_split
    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.metrics import f1_score
    
    X, y = make_classification(n_samples=1000, random_state=42)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4)
    X_test1, X_test2, y_test1, y_test2 = train_test_split(X_test, y_test, test_size=.5)
    
    print(y_train.shape, y_test1.shape, y_test2.shape)
    # (600,) (200,) (200,)
    
    clf = KNeighborsClassifier(n_neighbors=7).fit(X_train, y_train)
    
    print(f1_score(y_test1, clf.predict(X_test1)))
    print(f1_score(y_test2, clf.predict(X_test2)))
    # 0.819
    # 0.805