Search code examples
pythonpandasmachine-learningscikit-learntraining-data

Select a random subset of data


I have a dateset given to me that was previously split in training and validation (test) data. I need to further split the training data into a separate training data and calibration set, I don't want to touch my current validation (test) set. I don't have access to the original dataset.

I would like to do this randomly, so that every time I can run my script, I get a different training and calibration test. I am aware of the .sample() function but my training dataset is of 44000 rows.

Original Datasets

training = dataset.loc[dataset['split']== 'train']
print("Training Created")
#print(training.head())

validation = dataset.loc[dataset['split']== 'valid']
print("Validation Created")
#print(validation.head())

Where I would need something like this:

# proper training set
x_train = breast_cancer.values[:-100, :-1]
y_train = breast_cancer.values[:-100, -1]
# calibration set
x_cal = breast_cancer.values[-100:-1, :-1]
y_cal = breast_cancer.values[-100:-1, -1]
# (x_k+1, y_k+1)
x_test = breast_cancer.values[-1, :-1]
y_test = breast_cancer.values[-1, -1]

Unsure what to do with the second split

Example of Dataset

Object  | Variable | Split
Cancer1     55     Train
Cancer5     45     Train
Cancer2     56     Valid
Cancer3     68     Valid
Cancer4     75     Valid

Solution

  • It seems as you already have a column with train and validation sets assigned. The usual way is to use sklearn.model_selection.train_test_split. So to further split your training data into training and "calibration", just use it on the train set (note that you have to split into X and y):

    # initial split into train/test
    train = df.loc[df['Split']== 'train']
    test = df.loc[df['Split']== 'valid']
    
    # split the test set into features and target
    x_test = test.loc[:,:-1]
    y_test = test.loc[:,-1]
    
    # same with the train set
    X_train = train.loc[:,:-1]
    y_train = train.loc[:,-1]
    
    # split into train and validation sets
    X_train, X_calib, y_train, y_calib = train_test_split(X_train, y_train)