python numpy machine-learning scikit-learn auc

How to calculate roc auc score from positive unlabeled learning?

I'm trying to adapt some code for positive unlabeled learning from this example, which runs with my data but I want to also calculate the ROC AUC score which I'm getting stuck on.

My data is divided into positive samples (data_P) and unlabeled samples (data_U), each with only 2 features/columns of data such as:

#3 example rows:
data_P

[[-1.471,  5.766],
       [-1.672,  5.121],
       [-1.371,  4.619]]

#3 example rows:
data_U

[[1.23,  6.26],
       [-5.72,  4.1213],
       [-3.1,  7.129]]

I run the positive-unlabeled learning as in the linked example:

known_labels_ratio = 0.5

NP = data_P.shape[0]
NU = data_U.shape[0]

T = 1000
K = NP
train_label = np.zeros(shape=(NP+K,))
train_label[:NP] = 1.0
n_oob = np.zeros(shape=(NU,))
f_oob = np.zeros(shape=(NU, 2))
for i in range(T):
    # Bootstrap resample
    bootstrap_sample = np.random.choice(np.arange(NU), replace=True, size=K)
    # Positive set + bootstrapped unlabeled set
    data_bootstrap = np.concatenate((data_P, data_U[bootstrap_sample, :]), axis=0)
    # Train model
      model = DecisionTreeClassifier(max_depth=None, max_features=None, 
                                   criterion='gini', class_weight='balanced')
    model.fit(data_bootstrap, train_label)
    # Index for the out of the bag (oob) samples
    idx_oob = sorted(set(range(NU)) - set(np.unique(bootstrap_sample)))
    # Transductive learning of oob samples
    f_oob[idx_oob] += model.predict_proba(data_U[idx_oob])
    n_oob[idx_oob] += 1
    
predict_proba = f_oob[:, 1]/n_oob

This all runs fine but what I want is to run roc_auc_score() which I'm getting stuck on how to do without errors.

Currently I am trying:

y_pred = model.predict_proba(data_bootstrap)
roc_auc_score(train_label, y_pred)
ValueError: bad input shape (3, 2)

The problem seems to be that y_pred gives an output with 2 columns, looking like:

y_pred
array([[0.00554287, 0.9944571 ],
       [0.0732314 , 0.9267686 ],
       [0.16861796, 0.83138204]])

I'm not sure why y_pred ends up like this, is it giving the probability based on if the sample falls into 2 groups? Positive or other essentially? Could I just filter these to select per row the probability with the highest score? Or is there a way for me to change this or another way for me to calculate the AUCROC score?

Solution

y_pred must be a single number, giving the probability of the positive class p1; currently your y_pred consists of both probabilities [p0, p1] (with p0+p1=1.0 by definition).

Assuming that your positive class is class 1 (i.e. the second element of each array in y_pred), what you should do is:

y_pred_pos = [y_pred[i, 1] for i in range(len(y_pred))]
y_pred_pos # inspect
# [0.9944571, 0.9267686, 0.83138204]

roc_auc_score(train_label, y_pred_pos)

In case your y_pred is a Numpy array (and not a Python list), you could replace the list comprehension in the first command above with:

y_pred_pos  = y_pred[:,1]