Search code examples
pythonpandasdataframescikit-learncross-validation

Key Error when Implementing Cross Validation with GroupKFold


I have a df with 3 main columns 'label', 'embeddings' (features), 'chr'. I am trying to do a 10-fold cross validation by grouping the chromosomes such that the chr1 rows are all either in the train or test (not split across the train/test). I have a df that looks like: enter image description here

I believe I did it correctly in my code, but I keep running into this Key Error: enter image description here

Here's my code:

import numpy as np
from sklearn.model_selection import GroupKFold

X = np.array([np.array(x) for x in mini_df['embeddings']])
y = mini_df['label']
groups = mini_df['chromosome']
group_kfold = GroupKFold(n_splits=10)

# Initialize figure for plotting
plt.figure(figsize=(10, 6))

# Perform cross-validation and plot ROC curves for each fold
for i, (train_idx, val_idx) in enumerate(group_kfold.split(X, y, groups)):
    X_train_fold, X_val_fold = X[train_idx], X[val_idx]
    y_train_fold, y_val_fold = y[train_idx], y[val_idx]
    
    # Initialize classifier
    rf_classifier = RandomForestClassifier(n_estimators=n_trees, random_state=42, max_depth=max_depth, n_jobs=-1)
    
    # Train the classifier on this fold
    rf_classifier.fit(X_train_fold, y_train_fold)
    
    # Make predictions on the validation set
    y_pred_proba = rf_classifier.predict_proba(X_val_fold)[:, 1]
    
    # Calculate ROC curve
    fpr, tpr, _ = roc_curve(y_val_fold, y_pred_proba)
    
    # Calculate AUC
    roc_auc = auc(fpr, tpr)
    
    # Plot ROC curve for this fold
    plt.plot(fpr, tpr, lw=1, alpha=0.7, label=f'ROC Fold {i+1} (AUC = {roc_auc:.2f})')

# Plot ROC for random classifier
plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='r', label='Random', alpha=0.8)

# Add labels and legend
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves for Random Forest Classifier')
plt.legend(loc='lower right')
plt.show()

Solution

  • The error appears on the y object, and not on the X object. This means that the X[train_idx] and X[val_idx] operations are executed successfully.

    I see that X is a NumPy array, while y is probably a Pandas dataframe or series. You can try converting the Pandas object to a NumPy object (https://pandas.pydata.org/pandas-docs/version/0.24.0rc1/api/generated/pandas.Series.to_numpy.html):

    y = mini_df['label'].to_numpy()
    

    or if you want to keep y as a Pandas object then you should access the rows in y by index with iloc[]:

    y_train_fold, y_val_fold = y.iloc[train_idx], y.iloc[val_idx]