Search code examples
pythonnumpynaoutliers

Replace outlers with NA in numpy array


Given a numpy array like this

[[100, 110, 0.01, 110], [120, 100, 112, 100], [4000, 100, 200, 100]]

How can I replace the outliers with NA?

[[100, 110, NA, 110], [120, 100, 112, 100], [NA, 100, 200, 100]]

As for outlier detection, I'm happy with 2 SD's from the mean


Solution

  • I am assuming you have a SD and mean functions coded or imported somewhere.

    So you should code like this:

    sd = my_sd_function(my_array)
    mean = my_mean_function(my_array)
    outliers = (my_array > (mean + 2 * sd)) | (my_array < (mean - 2 * sd))
    my_array[outliers] = NA
    

    But consider:

    • Seriously, I don't know what do you mean by NA. Perhaps None?
    • I don't understand the structure of your array to make the appropriate functions. Perhaps these functions could satisfy your needs?:

      def my_mean_function(arr):
          return arr.sum() / arr.size
      
      def my_sd_function(arr):
          mean = my_mean_function(arr)
          sqrerr = ((arr - mean) ** 2).sum() / arr.size
          return sqrt(sqrerr)
      

    The core part here you should know is to actually select and update array elements based on a condition you want.

    Here, you will use & ~ and | instead of and not and or keywords. That is because numpy arrays have somehow defined their implementations to make use of such operators (and language constructors and or and not are not actually operators that one can overload).

    Such constructs return objects that can be threated like arrays (you can print outliers in console / ipython and see what I am talking about).

    The second part is that you can pass to my_array[...] as an index, actually a list of indexes or constructs like that (e.g. slices) and retrieve / alter such elements in an efficient way (IIRC such approach creates something called view to that data in the underlying origin data blob in the numpy array).