So here is the thing. I am applying a binary classifier for 5 patients(P1,P2,...P5). each patients has 100 samples of data and the output is either 0 or 1.
So I put one patient aside(say P5) as a testing data and used the remaining for validation and training. But I want to also find the optimal amount of hyper parameters for the classifier(say SVM) so I am using 4 fold cross validation for that as well.
However, I want to make sure that I split the training data to cross_training and cross_testing such that all samples of one patient stay in cross_testing fold. I don't want it to be shuffled because I would have data of a patient in both testing and training fold which is not good.
I am using GridSearchCV in python for splitting the data but I have no idea how to customize it such that we will have: 100 samples of p1 in testing fold and all 300 samples of p2,p3,p4 in training fold.......... 100 samples of p4 in testing fold and all 300 samples of p1,p2,p3 in training fold.
In other words I want to create a patient indicator so gridasearchCv split the data according to that.
Do we have a package on that or I should try writing it manually without using GridSearchCV or anything of that nature?
You should use scikit-learn GroupKFold
. It should solve your problem easily. Use a list patients
as groups, such that patients[i] == "p2"
if sample i
belongs to patient 2.
Here's the documentation.