While I appreciate this question is math-heavy, the real answer for this question will be helpful for all those, who are dealing with MongoDB's $bucket
operator (or its SQL analogies), and building cluster/heatmap chart data.
I have an array of unique/distinct values of prices from my DB (it's always an array of numbers
, with 0.01 precision).
As you may see, most of its values are between ~8 and 40 (in this certain case).
[
7.9, 7.98, 7.99, 8.05, 8.15, 8.25, 8.3, 8.34, 8.35, 8.39,
8.4, 8.49, 8.5, 8.66, 8.9, 8.97, 8.98, 8.99, 9, 9.1,
9.15, 9.2, 9.28, 9.3, 9.31, 9.32, 9.4, 9.46, 9.49, 9.5,
9.51, 9.69, 9.7, 9.9, 9.98, 9.99, 10, 10.2, 10.21, 10.22,
10.23, 10.24, 10.25, 10.27, 10.29, 10.49, 10.51, 10.52, 10.53, 10.54,
10.55, 10.77, 10.78, 10.98, 10.99, 11, 11.26, 11.27, 11.47, 11.48,
11.49, 11.79, 11.85, 11.9, 11.99, 12, 12.49, 12.77, 12.8, 12.86,
12.87, 12.88, 12.89, 12.9, 12.98, 13, 13.01, 13.49, 13.77, 13.91,
13.98, 13.99, 14, 14.06, 14.16, 14.18, 14.19, 14.2, 14.5, 14.53,
14.54, 14.55, 14.81, 14.88, 14.9, 14.98, 14.99, 15, 15.28, 15.78,
15.79, 15.8, 15.81, 15.83, 15.84, 15.9, 15.92, 15.93, 15.96, 16,
16.5, 17, 17.57, 17.58, 17.59, 17.6, 17.88, 17.89, 17.9, 17.93,
17.94, 17.97, 17.99, 18, 18.76, 18.77, 18.78, 18.99, 19.29, 19.38,
19.78, 19.9, 19.98, 19.99, 20, 20.15, 20.31, 20.35, 20.38, 20.39,
20.44, 20.45, 20.49, 20.5, 20.69, 20.7, 20.77, 20.78, 20.79, 20.8,
20.9, 20.91, 20.92, 20.93, 20.94, 20.95, 20.96, 20.99, 21, 21.01,
21.75, 21.98, 21.99, 22, 22.45, 22.79, 22.96, 22.97, 22.98, 22.99,
23, 23.49, 23.78, 23.79, 23.8, 23.81, 23.9, 23.94, 23.95, 23.96,
23.97, 23.98, 23.99, 24, 24.49, 24.5, 24.63, 24.79, 24.8, 24.89,
24.9, 24.96, 24.97, 24.98, 24.99, 25, 25.51, 25.55, 25.88, 25.89,
25.9, 25.96, 25.97, 25.99, 26, 26.99, 27, 27.55, 28, 28.8,
28.89, 28.9, 28.99, 29, 29.09, 30, 31.91, 31.92, 31.93, 33.4,
33.5, 33.6, 34.6, 34.7, 34.79, 34.8, 35, 38.99, 39.57, 39.99,
40, 49, 50, 50.55, 60.89, 99.99, 20000, 63000, 483000
]
I need to find in this array, irrelevant values, some kind of «dirty tail», and remove them. Actually, I don't even need to remove it from the array, the real case is to find the latest
relevant number. To define it as a cap
value, for finding a range between floor
(min relevant) and cap
(max relevant), like:
floor value => 8
cap value => 40
What am I talking about?
For example, for the array above: it will be all values after 40 (or maybe even 60), like 49, 50, 50.55, 60.89, 99.99, 20000, 63000, 483000
They all defined by me like a non-normal.
S tier. The clear/optimal code (language doesn't matter, but JavaScript preferred) or formula (if math has one) could solve the problem for a small / non-resourceful amount of time. It would be perfect, if I don't even need to check every element in the array, or could skip some of them, like starting from peak
/ most popular value in the array.
A tier. Your own experience or code
try with any relevant results or improving the current formula with better performance.
B tier. Something useful. Blog article/google link. The main requirement is to make sense. Non-obvious solutions are welcome. Even if your code is terribly formatted and so on.
By which criteria and how I should «target the tail» / remove non-relevant elements from the array with x (dramatically rising and rarely occurring) values?
The given data set has some huge outliers, which make it somewhat hard to analyze using standard statistical methods (if it were better behaved, I would recommend fitting several candidate distributions to it and finding out which fits best - log normal distribution, beta distribution, gamma distribution, etc).
The problem of determining which outliers to ignore can be solved in general through more simplistic but less rigorous methods; one method is to compare the values of the data at various percentiles and throw away the ones where the differences become "too high" (for a suitably chosen value of "too high").
For example, here are the last few entries if we go up by two percentile slots; the delta column gives the difference between the previous percentile and this one.
Here, you can see that the difference with the previous entry increases by almost 2 once we hit 87, and goes up (mostly) from there. To use a "nice" number, let's make the cut-off the 85th percentile, and ignore all values above that.
Given the sorted list above in array named data, we ignore any index above
Math.floor(data.length*85/100)
The analysis above can be repeated in code if it should change dynamically (or to call attention to deviations where 85 is not the right value), but I leave this as an exercise for the reader.