Search code examples
c++cluster-analysisk-meansunsupervised-learningstatistical-sampling

recognize the levels of 1D data by only knowing the number of levels


I have a sensor that output data consist of one attribute (mono value). An example of punch of sequenced data is as follows:

sample: 199 200 205 209 217 224 239 498 573 583 583 590 591 594 703 710 711 717 719 721 836 840 845 849 855 855 856 857 858 858 928 935 936 936 942 943 964 977

You can see the data from the first image input.

input

The data is divided into levels. The number of levels is given for me (5 levels in this example). However, the number of samples for each level is unknown, as well as the distances between the levels are also unknown.

I need to exclude the outliers and define the center of each level (look at the second image output.

output

The red samples represent the outliers and the yellow represent the centers of levels). Is there any algorithm, mathematical formula, c++ code may help me to achieve this requirement?

I tried KMeans (with K = 5 in this example) and I got bad result because of the random initial K centroids. Most time some inintial centroids share the same level that let this level become two clusters, whereas other two levels belongs to one cluster. If I set the initial centroids manually by selecting one centroid from each level I get very good results.


Solution

  • This is an extension of the answer of @KarthikeyanMV. +1. Yes, you need to be able to determine a value for Delta. Here is a process that will do that. I am writing my code in R, but I think that the process will be clear.

    Presumably, the gaps between groups are bigger than the gaps within any group, so just look at the difference between successive points and ask where the big gaps are. Since you believe that there should be 5 groups, there should be 4 big gaps, so look at the 4th biggest difference.

    ## Your data
    dat = c(199, 200, 205, 209, 217, 224, 239, 498, 573, 583, 
        583, 590, 591, 594, 703, 710, 711, 717, 719, 721, 
        836, 840, 845, 849, 855, 855, 856, 857, 858, 858, 
        928, 935, 936, 936, 942, 943, 964, 977)
    (Delta = sort(diff(dat), decreasing=TRUE)[4])
    [1] 75
    

    This looks like Delta should be 75, but we failed to account for the outliers. Are there any points that are more than Delta from both the next point above and below? Yes.

    BigGaps = diff(dat) >= Delta
    (Outliers = which(c(BigGaps, T) & c(T, BigGaps)))
    [1] 8
    

    Point 8 is too far away to belong to either the group above or below. So let's remove it and try again.

    dat = dat[-Outliers]
    (Delta = sort(diff(dat), decreasing=TRUE)[4])
    [1] 70
    BigGaps = diff(dat) >= Delta
    (Outliers = which(c(BigGaps, T) & c(T, BigGaps)))
    integer(0)
    

    After we remove point 8 the new Delta is 70. We check for outliers using the new Delta (70) and find none. So let's cluster using Delta = 70.

    Cluster = cumsum(c(1, diff(dat)>=Delta))
    plot(dat, pch=20, col=Cluster+1)
    

    Clustered data

    This mostly found the clusters that you want except that it included the last two points in the highest cluster rather than declaring them to be outliers. I do not see why they should be outliers instead of part of this group. Maybe you could elaborate on why you think that they should not be included.

    I hope that this helps.