The following extremely simplified DataFrame represents a much larger DataFrame containing medical diagnoses:
medicalData = pd.DataFrame({'diagnosis':['positive','positive','negative','negative','positive','negative','negative','negative','negative','negative']})
medicalData
diagnosis
0 positive
1 positive
2 negative
3 negative
4 positive
5 negative
6 negative
7 negative
8 negative
9 negative
Problem: For machine learning, I need to randomly split this dataframe into three subframes in the following way:
trainingDF, validationDF, testDF = SplitData(medicalData,fractions = [0.6,0.2,0.2])
...where the split array specifies the fraction of the complete data that goes into each subframe.
np.array_split
If you want to generalise to n
splits, np.array_split
is your friend (it works with DataFrames well).
fractions = np.array([0.6, 0.2, 0.2])
# shuffle your input
df = df.sample(frac=1)
# split into 3 parts
train, val, test = np.array_split(
df, (fractions[:-1].cumsum() * len(df)).astype(int))
train_test_split
A windy solution using train_test_split
for stratified splitting.
y = df.pop('diagnosis').to_frame()
X = df
X_train, X_test, y_train, y_test = train_test_split(
X, y,stratify=y, test_size=0.4)
X_test, X_val, y_test, y_val = train_test_split(
X_test, y_test, stratify=y_test, test_size=0.5)
Where X
is a DataFrame of your features, and y
is a single-columned DataFrame of your labels.