Given a numpy array like this
[[100, 110, 0.01, 110], [120, 100, 112, 100], [4000, 100, 200, 100]]
How can I replace the outliers with NA?
[[100, 110, NA, 110], [120, 100, 112, 100], [NA, 100, 200, 100]]
As for outlier detection, I'm happy with 2 SD's from the mean
I am assuming you have a SD and mean functions coded or imported somewhere.
So you should code like this:
sd = my_sd_function(my_array)
mean = my_mean_function(my_array)
outliers = (my_array > (mean + 2 * sd)) | (my_array < (mean - 2 * sd))
my_array[outliers] = NA
But consider:
None
?I don't understand the structure of your array to make the appropriate functions. Perhaps these functions could satisfy your needs?:
def my_mean_function(arr):
return arr.sum() / arr.size
def my_sd_function(arr):
mean = my_mean_function(arr)
sqrerr = ((arr - mean) ** 2).sum() / arr.size
return sqrt(sqrerr)
The core part here you should know is to actually select and update array elements based on a condition you want.
Here, you will use &
~
and |
instead of and
not
and or
keywords. That is because numpy arrays have somehow defined their implementations to make use of such operators (and language constructors and
or
and not
are not actually operators that one can overload).
Such constructs return objects that can be threated like arrays (you can print outliers
in console / ipython and see what I am talking about).
The second part is that you can pass to my_array[...]
as an index, actually a list of indexes or constructs like that (e.g. slices) and retrieve / alter such elements in an efficient way (IIRC such approach creates something called view to that data in the underlying origin data blob in the numpy array).