I have a dateset given to me that was previously split in training and validation (test) data. I need to further split the training data into a separate training data and calibration set, I don't want to touch my current validation (test) set. I don't have access to the original dataset.
I would like to do this randomly, so that every time I can run my script, I get a different training and calibration test. I am aware of the .sample() function but my training dataset is of 44000 rows.
training = dataset.loc[dataset['split']== 'train']
print("Training Created")
#print(training.head())
validation = dataset.loc[dataset['split']== 'valid']
print("Validation Created")
#print(validation.head())
Where I would need something like this:
# proper training set
x_train = breast_cancer.values[:-100, :-1]
y_train = breast_cancer.values[:-100, -1]
# calibration set
x_cal = breast_cancer.values[-100:-1, :-1]
y_cal = breast_cancer.values[-100:-1, -1]
# (x_k+1, y_k+1)
x_test = breast_cancer.values[-1, :-1]
y_test = breast_cancer.values[-1, -1]
Unsure what to do with the second split
Object | Variable | Split
Cancer1 55 Train
Cancer5 45 Train
Cancer2 56 Valid
Cancer3 68 Valid
Cancer4 75 Valid
It seems as you already have a column with train
and validation
sets assigned. The usual way is to use sklearn.model_selection.train_test_split
. So to further split your training data into training and "calibration", just use it on the train set (note that you have to split into X
and y
):
# initial split into train/test
train = df.loc[df['Split']== 'train']
test = df.loc[df['Split']== 'valid']
# split the test set into features and target
x_test = test.loc[:,:-1]
y_test = test.loc[:,-1]
# same with the train set
X_train = train.loc[:,:-1]
y_train = train.loc[:,-1]
# split into train and validation sets
X_train, X_calib, y_train, y_calib = train_test_split(X_train, y_train)