Search code examples
pandaslookupnaivebayesbernoulli-probability

Look up BernoulliNB Probability in Dataframe


I have some training data (TRAIN) and some test data (TEST). Each row of each dataframe contains an observed class (X) and some columns of binary (Y). BernoulliNB predicts the probability of X given Y in the test data based on the training data. I am trying to look up the probability of the observed class of each row in the test data (Pr).

Edit: I used Antoine Zambelli's advice to fix the code:

from sklearn.naive_bayes import BernoulliNB
BNB = BernoulliNB()

# Training Data
TRAIN = pd.DataFrame({'X' : [1,2,3,9],
                      'Y1': [1,1,0,0],
                      'Y4': [1,0,0,0]})

# Test Data
TEST  = pd.DataFrame({'X' : [5,0,1,1,1,2,2,2,2],
                      'Y1': [1,1,0,1,0,1,0,0,0],
                      'Y2': [1,0,1,0,1,0,1,0,1],
                      'Y3': [1,1,0,1,1,0,0,0,0],
                      'Y4': [1,1,0,1,1,0,0,0,0]})

# Add the information that TRAIN has none of the missing items
diff_cols = set(TEST.columns)-set(TRAIN.columns)
for i in diff_cols:
    TRAIN[i] = 0

# Split the data
Se_Tr_X = TRAIN['X']
Se_Te_X = TEST ['X']
df_Tr_Y = TRAIN .drop('X', axis=1)
df_Te_Y = TEST  .drop('X', axis=1)

# Train: Bernoulli Naive Bayes Classifier
A_F = BNB.fit(df_Tr_Y, Se_Tr_X)

# Test: Predict Probability
Ar_R = BNB.predict_proba(df_Te_Y)
df_R = pd.DataFrame(Ar_R)

# Rename the columns after the classes of X
df_R.columns = BNB.classes_

df_S = df_R .join(TEST)

# Look up the predicted probability of the observed X
# Skip X's that are not in the training data
def get_lu(df):
  def lu(i, j):
    return df.get(j, {}).get(i, np.nan)
  return lu
df_S['Pr'] = [*map(get_lu(df_R), df_S .T, df_S .X)]

This seemed to work, giving me the result (df_S):

enter image description here

This correctly gives a "NaN" for the first 2 rows because the training data contains no information about classes X=5 or X=0.


Solution

  • Ok, there's a couple issues here. I have a full working example below, but first those issues. Mainly the assertion that "This correctly gives a "NaN" for the first 2 rows".

    This ties back to the way classification algorithms are used and what they can do. The training data contains all the information you want your algorithm to know and be able to act on. The test data is only going to be processed with that information in mind. Even if you (the person) know that the test label is 5 and not included in the training data, the algorithm doesn't know that. It is only going to look at the feature data and then try to predict the label from those. So it can't return nan (or 5, or anything not in the training set) - that nan is coming from your work going from df_R to df_S.

    This leads to the second issue which is the line df_Te_Y = TEST .iloc[ : , 1 : ], that line should be df_Te_Y = TEST .iloc[ : , 2 : ], so that it does not include the label data. Label data only appears in the training set. The predicted labels will only ever be drawn from the set of labels that appear in the training data.

    Note: I've changed the class labels to be Y and the feature data to be X because that's standard in the literature.

    from sklearn.naive_bayes import BernoulliNB
    from sklearn.metrics import accuracy_score
    import pandas as pd
    
    BNB = BernoulliNB()
    
    # Training Data
    train_df = pd.DataFrame({'Y' : [1,2,3,9], 'X1': [1,1,0,0], 'X2': [0,0,0,0], 'X3': [0,0,0,0], 'X4': [1,0,0,0]})
    
    # Test Data
    test_df  = pd.DataFrame({'Y' : [5,0,1,1,1,2,2,2,2],
                          'X1': [1,1,0,1,0,1,0,0,0],
                          'X2': [1,0,1,0,1,0,1,0,1],
                          'X3': [1,1,0,1,1,0,0,0,0],
                          'X4': [1,1,0,1,1,0,0,0,0]})
    
    
    X = train_df.drop('Y', axis=1)  # Known training data - all but 'Y' column.
    Y = train_df['Y']  # Known training labels - just the 'Y' column.
    
    X_te = test_df.drop('Y', axis=1)  # Test data.
    Y_te = test_df['Y']  # Only used to measure accuracy of prediction - if desired.
    
    Ar_R = BNB.fit(X, Y).predict_proba(X_te)  # Can be combined to a single line.
    df_R = pd.DataFrame(Ar_R)
    df_R.columns = BNB.classes_  # Rename as per class labels.
    
    # Columns are class labels and Rows are observations.
    # Each entry is a probability of that observation being assigned to that class label.
    print(df_R)
    
    predicted_labels = df_R.idxmax(axis=1).values  # For each row, take the column with the highest prob in that row.
    print(predicted_labels)  # [1 1 3 1 3 2 3 3 3]
    
    print(accuracy_score(Y_te, predicted_labels))  # Percent accuracy of prediction.
    
    print(BNB.fit(X, Y).predict(X_te))  # [1 1 3 1 3 2 3 3 3], can be used in one line if predicted_label is all we want.
    # NOTE: change train_df to have 'Y': [1,2,1,9] and we get predicted_labels = [1 1 9 1 1 1 9 1 9].
    # So probabilities have changed.
    

    I recommend reviewing some tutorials or other material on clustering algorithms if this doesn't make sense after reading the code.