Search code examples
pythonmachine-learningscikit-learnkernel-density

How to normalize Kernel Density Estimation using scikit?


I am using KDE for multi-class classification. I am implementing it using scikit. As mentioned on the website, the KDE for a point x is defined as,

Should I normalize the result while comparing different kernel density estimates for different classes?

Link for KDE:
http://scikit-learn.org/stable/modules/density.html#kernel-density-estimation


Solution

  • Equality does not hold, this is clearly a bad documentation example. You can see in the code that it is normalized, like here

    log_density -= np.log(N)
    return log_density
    

    so you clearly divide by N.

    The correct formula, from mathematical perspective is actually either

    1/N SUM_i K(x_i - x)
    

    or

    1/(hN) SUM_i K((x_i - x)/h)
    

    you can also dive deeper into .c code actually computing kernels and you will find that they are internally normalized

     case __pyx_e_7sklearn_9neighbors_9ball_tree_GAUSSIAN_KERNEL:
    
     /* "binary_tree.pxi":475
     *     cdef ITYPE_t k
     *     if kernel == GAUSSIAN_KERNEL:
     *         factor = 0.5 * d * LOG_2PI             # <<<<<<<<<<<<<<
     *     elif kernel == TOPHAT_KERNEL:
     *         factor = logVn(d)
     */
        __pyx_v_factor = ((0.5 * __pyx_v_d) * __pyx_v_7sklearn_9neighbors_9ball_tree_LOG_2PI);
        break;
    

    Thus each K actually integrates to 1 and consequently you just take an average to get valid density for whole KDE, and this is exactly what happens inside.