Search code examples
phpanomaly-detection

Finding and removing outliers / anomalies in an array of numbers in PHP


I have an array of numbers like this in PHP:

$numbers = [
    0.0021030494216614,
    0.0019940179461615,
    0.0079320972662613,
    0.0040485829959514,
    0.0079320972662613,
    0.0021030494216614,
    0.0019940179461615,
    0.0079320972662613,
    0.0040485829959514,
    0.0079320972662613,
    0.0021030494216614,
    1.1002979145978,
    85.230769230769,
    6.5833333333333,
    0.015673981191223
];

In PHP, I am trying to find the outliers / anomalies in this array.

As you can see, the anomalies are

1.1002979145978,
85.230769230769,
6.5833333333333,
0.015673981191223

I am trying to find and remove the anomalies in any array.

Here is my code

function remove_anomalies($dataset, $magnitude = 1) {
    $count = count($dataset);
    $mean = array_sum($dataset) / $count;
    $deviation = sqrt(array_sum(array_map("sd_square", $dataset, array_fill(0, $count, $mean))) / $count) * $magnitude;
        
    return array_filter($dataset, function($x) use ($mean, $deviation) { return ($x <= $mean + $deviation && $x >= $mean - $deviation); });
}
    
function sd_square($x, $mean) {
    return pow($x - $mean, 2);
}

However, when I put my array of $numbers in, it only gives me [85.230769230769] as the outlier when there are clearly more outliers there. I have tried fiddling with the $magnitude and that did not improve anything.


Solution

  • The algorithm shown here uses mean absolute deviation (MAD) a robust measure to identify outliers. All elements whose distance exceeds a multiple of the MAD are continuously removed and the MAD is recalculated.

      function median(array $data)
      {
        if(($count = count($data)) < 1) return false;
        sort($data, SORT_NUMERIC);
        $mid = (int)($count/2);
        if($count % 2) return $data[$mid];
        return  ($data[$mid] + $data[$mid-1])/2;
      }
      
      function mad(array $data)
      {
        if(($count = count($data)) < 1) return false;
        $median = median($data);
        $mad = 0.0;
        foreach($data as $xi) {
          $mad += abs($xi - $median);
        }
        return $mad/$count;
      }
    
      function cleanMedian(array &$data, $fac = 2.0)
      {
        do{
          $unsetCount = 0;
          $median = median($data);
          $mad = mad($data) * $fac;
          //remove all with diff > $mad
          foreach($data as $idx => $val){
            if(abs($val - $median) > $mad){
              unset($data[$idx]);
              ++$unsetCount;
            }
          }
        } while($unsetCount > 0);
      }
    

    How to use:

    $data = [
     //..
    ];
    cleanMedian($data);
    

    The parameter $fac needs to be experimented with depending on the data. With the $ fac = 2 you get the desired result.

    array (
      0 => 0.0021030494216614,
      1 => 0.0019940179461615,
      2 => 0.0079320972662613,
      3 => 0.0040485829959514,
      4 => 0.0079320972662613,
      5 => 0.0021030494216614,
      6 => 0.0019940179461615,
      7 => 0.0079320972662613,
      8 => 0.0040485829959514,
      9 => 0.0079320972662613,
      10 => 0.0021030494216614,
    )
    

    With fac = 4, the value 0.015673981191223 is included.