Search code examples
pythonnumpymachine-learningnaivebayes

Using Naive Bayes to calculate X given Y


I am learning the Naïve Bayes Classifier.

I have a matrix of vectors. Some vectors have the class label 1 (boy), the other vectors have the class label 0 (girl).

There are 128 features in each vector. Each feature can either be 0 or -1. I need to determine the probability of each feature being 0 or -1 given that y is either 0 or 1.

I've been provided the starter code below. The general guidance during the course is to avoid loops.

I've been through all of the provided material numerous times and just can't get how to do this. I'm not asking for the answer, just some guidance to get me started.

def naivebayesPXY(X,Y):
    """
    naivebayesPXY(X, Y) returns [posprob,negprob]
    
    Input:
        X : n input vectors of d dimensions (nxd)
        Y : n labels (-1 or +1) (n)
    
    Output:
        posprob: probability vector of p(x_alpha = 1|y=1)  (d)
        negprob: probability vector of p(x_alpha = 1|y=-1) (d)
    """
    
    # add one positive and negative example to avoid division by zero ("plus-one smoothing")
    n, d = X.shape
    X = np.concatenate([X, np.ones((2,d)), np.zeros((2,d))])
    Y = np.concatenate([Y, [-1,1,-1,1]])
    

    
    return posprob,negprob

I think I have to sum all the features in the Boy vectors and then divide by ... something. And sum all the features in the Girl vectors and then divide by ... something.

But everything I try related to this fails the autograding tests.


Solution

  • Thanks for the help. Davide's comment helped me recognize that I didn't understand the task perfectly. I was able to figure out a working solution.

    def naivebayesPXY(X,Y):
    """
    naivebayesPXY(X, Y) returns [posprob,negprob]
    
    Input:
        X : n input vectors of d dimensions (nxd)
        Y : n labels (-1 or +1) (n)
    
    Output:
        posprob: probability vector of p(x_alpha = 1|y=1)  (d)
        negprob: probability vector of p(x_alpha = 1|y=-1) (d)
    """
    
    ## The next three lines were provided by the challenge. 
    n, d = X.shape
    X = np.concatenate([X, np.ones((2,d)), np.zeros((2,d))])
    Y = np.concatenate([Y, [-1,1,-1,1]])
    
    # Separate data by gender
    X_boys = X[Y == 1]
    X_girls = X[Y == -1]
    
    ## Count total hot(1)counts for each feature for the boys class
    boys_ones = X_boys.sum(axis=0)
    ## Count total not(0)counts for each feature for the boys class
    boys_zeroes = X_boys.shape[0] - boys_ones
    ## Divide hot count for boys by total features
    pos_prob = boys_ones / (boys_ones + boys_zeroes)
    
    ## Same process from above, for the girls class
    girls_ones = X_girls.sum(axis=0)
    girls_zeroes = X_girls.shape[0] - girls_ones
    neg_prob = girls_ones / (girls_ones + girls_zeroes)
    
    return pos_prob, neg_prob