Search code examples
c++kernel-density

How to get a function from a kernel density estimation?


I have tried implementing a KDE but I'm getting stuck up with the math a bit.

I'll give some pseudecode of my current way of performing a KDE, to better show you my problem.

(I replicated the wikipedia example: https://en.wikipedia.org/wiki/Kernel_density_estimation)

  1. Define the interval of the KDE and how many points are used ([-6, 11] and I used 1000 "points" between -6.0 and 11.0])

  2. Iterate over all the points and give every point a probability by adding the kernels from the given data points. Now every point between -6 to 11 has a probability of being chosen.

  3. Make sure the probability of all points add up to 100% and draw samples according to the probability of the points.

This works and will gives the correct result if I plot it, but i can't help but feel this is a very backwards way of doing things.

It would be nice not having to calculate a probability for every point in the interval but to just get a formula from the KDE which I give random numbers and get samples according to the probability. Does anyone know how to do that?

Btw I use c++ and would like to continue doing so.


Solution

  • If you were given that your sample is following some distribution and you know the parameters and PDF equation then you can plot them with the help of equation but if the distribution is not given you can't create a formula or at least it should be avoided. Example: If given the sample is following Gaussian then one can find mean and variance and plug them in Gaussian equation and then plot PDF.

    KDE is used to find the Probability density function(PDF) for a finite data samples and you can't get a formula/equation for a sample which can be used directly to plot PDF. The whole idea behind KDE is to have general estimator for PDF (smooth curve) and not an equation.

    And how KDE works is mentioned by you above and given in wiki with example that is:- 1. Calculate probability(frequency) for each points. 2. Draw Gaussian kernels for each points with that same point as mean and some bandwidth(parameter). 3. Sum all those kernels to get a final result.

    The main reason why you can't have equation is because of the unknown parameter - bandwidth in KDE. The curve will keep changing for the sample as you change bandwidth and the equation can't be written in some polynomial equation form.