Search code examples
matlabcluster-analysiskernel-density

Kernel Density Estimation for clustering 1 dimensional data


I am using Matlab and the code provided at http://www.mathworks.com/matlabcentral/fileexchange/14034-kernel-density-estimator/content/kde.m

to cluster 1D data. In particular I estimate the density function of my data and then analysing the peaks I should be able to identify the different distributions that form my dataset. (correct?) I then cluster the points according to these cluster centroids (peaks in density function).

You can find my data (z) at: https://drive.google.com/file/d/0B3vXKJ_zYaCJLUE3YkVBMmFtbUk/view?usp=sharing

and the plot of the probabiity density function at: https://drive.google.com/file/d/0B3vXKJ_zYaCJTjVobHRBOXo4Tmc/view?usp=sharing

What I did was simply to run

   [bandwidth,density,xmesh]=kde(z);

   plot(xmesh,density);

What I get (please have a look at the second link) is 1 peak in the density function per data point.... I think that I am doing something wrong... Could the default parameter of the kde function be the cause?

kde(data,n,MIN,MAX)
%     data    - a vector of data from which the density estimate is constructed;
%          n  - the number of mesh points used in the uniform discretization of the
%               interval [MIN, MAX]; n has to be a power of two; if n is not a power of two, then
%               n is rounded up to the next power of two, i.e., n is set to n=2^ceil(log2(n));
%               the default value of n is n=2^12;
%   MIN, MAX  - defines the interval [MIN,MAX] on which the density estimate is constructed;
%               the default values of MIN and MAX are:
%               MIN=min(data)-Range/10 and MAX=max(data)+Range/10, where Range=max(data)-min(data);

Would this be possible? Could you tell me on which basis should I change them?


Solution

  • You point out the solution in your question. The documentation suggests that the algorithm sets a ceiling of 2^N peaks created from the data. The default (16k or 2^14) is larger than the number of points you supplied (~8k) resulting in the "spiky" behaviour.

    If you instead run

     [bandwidth,density,xmesh]=kde(z,2^N);
    

    for different values of 2^N (the function demands a power of 2, must be a FFT thing) you get a plot like follows:

    enter image description here

    based on which you can pick the value of N that is appropriate.