Search code examples
scikit-learnkernel-density

Kernel Density Estimation, Lists, and For Loops: Different Inputs, Same Output


I would like to create a Gaussian Kernel Density Estimate of two samples of fantasy team scores over the first six weeks of the NFL season. To do this, I created a list of two different KernelDensity objects and plotted the log probability of each score form 0 to 300 according to each KDE function. At this point I have different log probabilities for each score. For some reason, however, when I exponentiate each log probability, I suddenly have equal values for each KDE.

A successful answer clearly identifies what mistake has been made that somehow passes each score to the second KDE and then provides a solution.

# Import Modules

import math
from sklearn.neighbors import KernelDensity
import numpy as np
X = np.array([[132,151,109,71,104,100],[123,182,102,123,108,82]]).transpose()

# Create a list to put two KernelDensity objects in 
kde = [[],[]]
for i in range(2):
    kde[i] = KernelDensity(kernel='gaussian', bandwidth=5).fit(X[:,i].reshape(-1,1))

# Create a list to place the log probabilities 
log_prob = [[],[]]
for i in range(2):
        X = np.arange(0,300,1)
        log_prob[i] = kde[i].score_samples(X.reshape(-1,1)).reshape(-1,1)

# (A mistake has been made in this section) Create a list for the probability of each score according to the two different KDEs

prob = [[0]*300]*2
for i in range(2):
    for j in range(300):
        prob[i][j] = math.exp(log_prob[i][j])

Solution

  • Ya it has to do with how you are constructing the lists. I can't give you a very technically answer as to why this happens, but I've had the same probably in the past in that when when trying to construct lists by index positions, any changes made to a list, also affected, overwrote previous elements in a list.

    The way around this is to use a .copy(), or in what I did here, was create separate lists that then are appended, as opposed to using index positions to set a value.

    # Import Modules
    
    import math
    from sklearn.neighbors import KernelDensity
    import numpy as np
    X = np.array([[132,151,109,71,104,100],[123,182,102,123,108,82]]).transpose()
    
    # Create a list to put two KernelDensity objects in 
    kde = [[],[]]
    for i in range(2):
        kde[i] = KernelDensity(kernel='gaussian', bandwidth=5).fit(X[:,i].reshape(-1,1))
    
    # Create a list to place the log probabilities 
    log_prob = [[],[]]
    for i in range(2):
            X = np.arange(0,300,1)
            log_prob[i] = kde[i].score_samples(X.reshape(-1,1)).reshape(-1,1)
    
    # (A mistake has been made in this section) Create a list for the probability of each score according to the two different KDEs
    
    prob = []
    for i in range(2):
        prob_list_alpha = []
        for j in range(300):
            prob_list_alpha.append(math.exp(log_prob[i][j]))
            
        prob.append(prob_list_alpha)
    

    Output:

    enter image description here