Search code examples
pythonpandasdataframemachine-learningsampling

Stratified splitting of pandas dataframe into training, validation and test set


The following extremely simplified DataFrame represents a much larger DataFrame containing medical diagnoses:

medicalData = pd.DataFrame({'diagnosis':['positive','positive','negative','negative','positive','negative','negative','negative','negative','negative']})
medicalData

    diagnosis
0   positive
1   positive
2   negative
3   negative
4   positive
5   negative
6   negative
7   negative
8   negative
9   negative

Problem: For machine learning, I need to randomly split this dataframe into three subframes in the following way:

trainingDF, validationDF, testDF = SplitData(medicalData,fractions = [0.6,0.2,0.2])

...where the split array specifies the fraction of the complete data that goes into each subframe.


Solution

  • np.array_split

    If you want to generalise to n splits, np.array_split is your friend (it works with DataFrames well).

    fractions = np.array([0.6, 0.2, 0.2])
    # shuffle your input
    df = df.sample(frac=1) 
    # split into 3 parts
    train, val, test = np.array_split(
        df, (fractions[:-1].cumsum() * len(df)).astype(int))
    

    train_test_split

    A windy solution using train_test_split for stratified splitting.

    y = df.pop('diagnosis').to_frame()
    X = df
    

    X_train, X_test, y_train, y_test = train_test_split(
            X, y,stratify=y, test_size=0.4)
    
    X_test, X_val, y_test, y_val = train_test_split(
            X_test, y_test, stratify=y_test, test_size=0.5)
    

    Where X is a DataFrame of your features, and y is a single-columned DataFrame of your labels.