Imagine I have an array like the following:
[0.1,0.12,0.14,0.45,0.88,0.91,0.94,14.3,15,16]
I'd like to identify patterns in this, so I can compare it to another dataset to see if it matches. For instance, if I input 0.89
, I'd like to be able to see this belongs to the 0.88
-0.94
cluster. However, if I enter 0.5
, I'd like to see that this does not belong in the dataset, even though it is close to 0.45
- an anomaly in the data.
(The above array contains sample numbers, but in the actual system I'm comparing properties of HTML code to categorise them. I'm using Tensorflow for text categorisation, but some things (such as CSS length, CSS:HTML ratio) are numbers. While there are patterns in this, it's not obvious or in one place - e.g category A might have a lot of very high values and low values, but almost none in between. I can't give you the real numbers because those are determined by the code inputted and the ML preprocesser, but we can assume the numbers are about 10% anomaly, and almost always try to show one or some combination of middle, lower or upper. When 'training', these numbers are taken from the data and stored in one of the arrays (representing the three categories). I then want to take my input and tell which of the arrays' patterns seems to line up with the input number.)
Now, imagine the array is hundreds or thousands of items long. At least 10% will be anomalies, and I need to account for that. I guess cluster detection isn't the correct term - it's mainly getting rid of anomalies - but the part I got stuck on particularly was having ranges of different sizes. For instance, in the example above I'd still like 14.3
-16
to count as one range/cluster, even though there are much further apart than0.1
-0.14
.
I've done some digging through the Wikipedia article (https://en.m.wikipedia.org/wiki/Anomaly_detection) on the topic, and found that the most likely functional and simple approach would be K-nearest-neighbour-style density analysis. However, I've not been able to find any Python plug-in that can easily do this for me - the issue is, there are so many variations on this specific task that it's basically impossible to find exactly what I'm looking for. I've also tried making my own basic algorithm to compare each item to its neighbour and see which one it is closer to (to cluster), or if it the distance is greater than 2* the mean of the distances between other items in the clusters class it as an anomaly. However, this wasn't very accurate, and still had an element of human bias (why 2*, not 3*?); furthermore, it went completely haywire at the start and end or the array. Therefore, if any of you have a recommendation for a quick algorithm that would work even better, or an implementation of the aforementioned, that would be greatly appreciated.
Thanks in advance.
Use classical statistical techniques such as kernel density estimation. There are well known heuristics to choose the bandwidth. KDE is easy and the preferred choice on 1 dimensional data.
Then define a density threshold. Points below the threshold are removed, and split the data into clusters.