Search code examples
scikit-learnmodelsklearn-pandas

How to Extract Rows Specific info from Sklearn Logistic Predictions


I have a logistic regression that predicts which customers are going to churn. I can't seem to find code anywhere that extracts the Accounts that are predicted to churn. Account Name is an string object so I'm not feeding it into the logistic model, but I need to map the predicted churn rows back into the original table

This is what my data looks like, however I can't replicate this issue in a smaller sampple size:

import random
random_data = [['ABC', 'yes'],['AAA','yes'],
    ['BBB','no'],['XTZ','no'],['ADB','no']]
df = pd.DataFrame(random_data,columns=['Account','Target'])
df['height'] = random.sample(xrange(10), len(df))
df['weight'] = random.sample(xrange(10), len(df))
X_train_pd = df.drop(['Account','Target'], axis=1) 
y_train_pd = df['Target'] 


logreg = LogisticRegression()
logreg.fit(X_train_pd, y_train_pd)
y_pred_train = logreg.predict(X_train_pd)

Here is what I've tried. Its Hacky and the Bug is shown below "Extract Account Names Predicted to Churn"

y_pred_prob_df = pd.DataFrame(logreg.predict_proba(X_test))

data = np.array([y_test_pd, y_pred_test ])
data_y = pd.DataFrame({'y_test':data[0],'y_pred_test':data[1]} )

ID = test[['Account Name', 'Status']]

Accounts=pd.concat([ID, data_y, y_pred_prob_df], axis=1) 

Here is the BUG: When I concat the actual y, predicted y, the probabilities, original dataset (ID) I get a an extra few rows. If I take out ID, this resolves the bug.

print ID.shape #(250, 2)
print data_y.shape #(250, 2)
print y_pred_prob_df.shape #(250, 2)
print Accounts.shape, "(267, 6) <-- BUG "

s=pd.concat([data_y, y_pred_prob_df], axis=1) 
print s.shape, "(250, 4) <-- Resolves BUG: ID is the issue"  

The Hacky way isn't working... We want to extract ONLY Accounts that are predicted to churn

The Result I'm looking for is one data frame with all my features, the target, the predicted to churn flg, and probability of the prediction. Specifically, Is Account Name 'ABC' predicted to churn? probably of that prediction? and all the fields that went into the model

Seems like I can't use loc to find only the accounts predicted to churn


Solution

  • To get accounts that are predicted to churn you can simply write:

    df.loc[y_pred_train == "yes"]
    

    And to get probabilities:

    y_pred_prob_df.loc[y_pred_train == "yes"]