Search code examples
pythonnumpymachine-learningcross-validationk-fold

How to create Training Sets for K-Fold Cross Validation without ski-kit learn?


I have a data set that has 95 rows and 9 columns and want to do a 5-fold cross-validation. In the training, the first 8 columns (features) are used to predict the ninth column. My test sets are correct, but my x training set is of size (4,19,9) when it should have only 8 columns and my y training set is (4,9) when it should have 19 rows. Am I indexing the subarrays incorrectly?

kdata = data[0:95,:] # Need total rows to be divisible by 5, so ignore last 2 rows 
np.random.shuffle(kdata) # Shuffle all rows
folds = np.array_split(kdata, k) # each fold is 19 rows x 9 columns

for i in range (k-1):
    xtest = folds[i][:,0:7] # Set ith fold to be test
    ytest = folds[i][:,8]
    new_folds = np.delete(folds,i,0)
    xtrain = new_folds[:][:][0:7] # training set is all folds, all rows x 8 cols
    ytrain = new_folds[:][:][8]   # training y is all folds, all rows x 1 col

Solution

  • Welcome to Stack Overflow.

    Once you created a new fold, you need to stack them row-wise using np.row_stack().

    Also, I think you are slicing the array incorrectly, in Python or Numpy, the slicing behaviour is [inclusive:exclusive] thus, when you specify the slice as [0:7] you are only taking 7 columns, instead of 8 feature columns as you intended.

    Similarly, if you are specifying 5 fold in your for loop, it should be range(k) which gives you [0,1,2,3,4] instead of range(k-1) which only gives you [0,1,2,3].

    Modified code as such:

    folds = np.array_split(kdata, k) # each fold is 19 rows x 9 columns
    np.random.shuffle(kdata) # Shuffle all rows
    folds = np.array_split(kdata, k)
    
    for i in range (k):
        xtest = folds[i][:,:8] # Set ith fold to be test
        ytest = folds[i][:,8]
        new_folds = np.row_stack(np.delete(folds,i,0))
        xtrain = new_folds[:, :8]
        ytrain = new_folds[:,8]
    
        # some print functions to help you debug
        print(f'Fold {i}')
        print(f'xtest shape  : {xtest.shape}')
        print(f'ytest shape  : {ytest.shape}')
        print(f'xtrain shape : {xtrain.shape}')
        print(f'ytrain shape : {ytrain.shape}\n')
    

    which will print out the fold and the desired shape of training and testing sets for you:

    Fold 0
    xtest shape  : (19, 8)
    ytest shape  : (19,)
    xtrain shape : (76, 8)
    ytrain shape : (76,)
    
    Fold 1
    xtest shape  : (19, 8)
    ytest shape  : (19,)
    xtrain shape : (76, 8)
    ytrain shape : (76,)
    
    Fold 2
    xtest shape  : (19, 8)
    ytest shape  : (19,)
    xtrain shape : (76, 8)
    ytrain shape : (76,)
    
    Fold 3
    xtest shape  : (19, 8)
    ytest shape  : (19,)
    xtrain shape : (76, 8)
    ytrain shape : (76,)
    
    Fold 4
    xtest shape  : (19, 8)
    ytest shape  : (19,)
    xtrain shape : (76, 8)
    ytrain shape : (76,)