Search code examples
pythonimblearnsmote

How can I extract the newly added rows after SMOTE (imblearn module)


Is it possible to extract the newly added rows from pandas dataframe that were created by imblearn's smote function?


Solution

  • I think I figured it out. Apparently they are being appended at the end of the fit_resample returned dataframe:

    my target is "DIED"

    smotez = SMOTENC([10,11], random_state=555, k_neighbors=10)
    smote_tomek = SMOTETomek(random_state=555, smote=smotez , n_jobs=-1)  
    X_train_new, y_train_new = smote_tomek.fit_resample(X_train, y_train) 
    train_data_new = pd.concat([X_train_new.iloc[1:],y_train_new],axis=1)
    train_data_new.dropna(inplace=True)
    smote_data = train_data_new.iloc[len(train_data)-1:,]
    print("Y_train_smote:\n", npunique(smote_data['DIED']),smote_data['DIED'].mean())
    

    As you can see, all rows are of the minority class ("DIED")

    Y_train_smote: [[ 1 91936]] 1.0

    Double-checking, the expression below should return 0:

    print(len(smote_data) + len(X_train) - len(X_train_new))
    

    0