I have an array of numbers like this in PHP:
$numbers = [
0.0021030494216614,
0.0019940179461615,
0.0079320972662613,
0.0040485829959514,
0.0079320972662613,
0.0021030494216614,
0.0019940179461615,
0.0079320972662613,
0.0040485829959514,
0.0079320972662613,
0.0021030494216614,
1.1002979145978,
85.230769230769,
6.5833333333333,
0.015673981191223
];
In PHP, I am trying to find the outliers / anomalies in this array.
As you can see, the anomalies are
1.1002979145978,
85.230769230769,
6.5833333333333,
0.015673981191223
I am trying to find and remove the anomalies in any array.
Here is my code
function remove_anomalies($dataset, $magnitude = 1) {
$count = count($dataset);
$mean = array_sum($dataset) / $count;
$deviation = sqrt(array_sum(array_map("sd_square", $dataset, array_fill(0, $count, $mean))) / $count) * $magnitude;
return array_filter($dataset, function($x) use ($mean, $deviation) { return ($x <= $mean + $deviation && $x >= $mean - $deviation); });
}
function sd_square($x, $mean) {
return pow($x - $mean, 2);
}
However, when I put my array of $numbers
in, it only gives me [85.230769230769]
as the outlier when there are clearly more outliers there.
I have tried fiddling with the $magnitude
and that did not improve anything.
The algorithm shown here uses mean absolute deviation (MAD) a robust measure to identify outliers. All elements whose distance exceeds a multiple of the MAD are continuously removed and the MAD is recalculated.
function median(array $data)
{
if(($count = count($data)) < 1) return false;
sort($data, SORT_NUMERIC);
$mid = (int)($count/2);
if($count % 2) return $data[$mid];
return ($data[$mid] + $data[$mid-1])/2;
}
function mad(array $data)
{
if(($count = count($data)) < 1) return false;
$median = median($data);
$mad = 0.0;
foreach($data as $xi) {
$mad += abs($xi - $median);
}
return $mad/$count;
}
function cleanMedian(array &$data, $fac = 2.0)
{
do{
$unsetCount = 0;
$median = median($data);
$mad = mad($data) * $fac;
//remove all with diff > $mad
foreach($data as $idx => $val){
if(abs($val - $median) > $mad){
unset($data[$idx]);
++$unsetCount;
}
}
} while($unsetCount > 0);
}
How to use:
$data = [
//..
];
cleanMedian($data);
The parameter $fac needs to be experimented with depending on the data. With the $ fac = 2 you get the desired result.
array (
0 => 0.0021030494216614,
1 => 0.0019940179461615,
2 => 0.0079320972662613,
3 => 0.0040485829959514,
4 => 0.0079320972662613,
5 => 0.0021030494216614,
6 => 0.0019940179461615,
7 => 0.0079320972662613,
8 => 0.0040485829959514,
9 => 0.0079320972662613,
10 => 0.0021030494216614,
)
With fac = 4, the value 0.015673981191223 is included.