Search code examples
kernel-densityprobability-density

Density Estimation of a stream of Data


What statistical methods out there that will estimate the probability density of data as it arrives temporally?

I need to estimate the pdf of a multivariate dataset; however, new data arrives over time and as the data arrives the density estimation must update.

What I have been using so far is kernel estimations by storing a buffer of the data and computing a new kernel density estimation with every update of new data; however, I can no longer keep up with the amount of data needed to be stored. Therefore, I need a method that will keep track of the overall pdf/density estimation rather that the individual datum. Any suggestions would be really helpful. I work in Python, but since this is long-winded any algorithm suggestions would be also helpful.


Solution

  • Scipy's implementation of KDE includes the functionality to increment the KDE by each datum instead of for each point. This is nested inside a "if more points than data" loop, but you could probably re-purpose it for your needs.

    if m >= self.n:
        # there are more points than data, so loop over data
        for i in range(self.n):
            diff = self.dataset[:, i, newaxis] - points
            tdiff = dot(self.inv_cov, diff)
            energy = sum(diff*tdiff,axis=0) / 2.0
            result = result + exp(-energy)
    

    In this case, you could store the result of your kde as result, and each time you get a new point you could just calculate the new Gaussian and add it to your result. Data can be dropped as needed, you are only storing the KDE.