Search code examples
pythonnumpymachine-learningscikit-learnauc

How to calculate roc auc score from positive unlabeled learning?


I'm trying to adapt some code for positive unlabeled learning from this example, which runs with my data but I want to also calculate the ROC AUC score which I'm getting stuck on.

My data is divided into positive samples (data_P) and unlabeled samples (data_U), each with only 2 features/columns of data such as:

#3 example rows:
data_P

[[-1.471,  5.766],
       [-1.672,  5.121],
       [-1.371,  4.619]]

#3 example rows:
data_U

[[1.23,  6.26],
       [-5.72,  4.1213],
       [-3.1,  7.129]]

I run the positive-unlabeled learning as in the linked example:

known_labels_ratio = 0.5

NP = data_P.shape[0]
NU = data_U.shape[0]

T = 1000
K = NP
train_label = np.zeros(shape=(NP+K,))
train_label[:NP] = 1.0
n_oob = np.zeros(shape=(NU,))
f_oob = np.zeros(shape=(NU, 2))
for i in range(T):
    # Bootstrap resample
    bootstrap_sample = np.random.choice(np.arange(NU), replace=True, size=K)
    # Positive set + bootstrapped unlabeled set
    data_bootstrap = np.concatenate((data_P, data_U[bootstrap_sample, :]), axis=0)
    # Train model
      model = DecisionTreeClassifier(max_depth=None, max_features=None, 
                                   criterion='gini', class_weight='balanced')
    model.fit(data_bootstrap, train_label)
    # Index for the out of the bag (oob) samples
    idx_oob = sorted(set(range(NU)) - set(np.unique(bootstrap_sample)))
    # Transductive learning of oob samples
    f_oob[idx_oob] += model.predict_proba(data_U[idx_oob])
    n_oob[idx_oob] += 1
    
predict_proba = f_oob[:, 1]/n_oob

This all runs fine but what I want is to run roc_auc_score() which I'm getting stuck on how to do without errors.

Currently I am trying:

y_pred = model.predict_proba(data_bootstrap)
roc_auc_score(train_label, y_pred)
ValueError: bad input shape (3, 2)

The problem seems to be that y_pred gives an output with 2 columns, looking like:

y_pred
array([[0.00554287, 0.9944571 ],
       [0.0732314 , 0.9267686 ],
       [0.16861796, 0.83138204]])

I'm not sure why y_pred ends up like this, is it giving the probability based on if the sample falls into 2 groups? Positive or other essentially? Could I just filter these to select per row the probability with the highest score? Or is there a way for me to change this or another way for me to calculate the AUCROC score?


Solution

  • y_pred must be a single number, giving the probability of the positive class p1; currently your y_pred consists of both probabilities [p0, p1] (with p0+p1=1.0 by definition).

    Assuming that your positive class is class 1 (i.e. the second element of each array in y_pred), what you should do is:

    y_pred_pos = [y_pred[i, 1] for i in range(len(y_pred))]
    y_pred_pos # inspect
    # [0.9944571, 0.9267686, 0.83138204]
    
    roc_auc_score(train_label, y_pred_pos)
    

    In case your y_pred is a Numpy array (and not a Python list), you could replace the list comprehension in the first command above with:

    y_pred_pos  = y_pred[:,1]