The sample data is like this.
I want select outliers out of 4 sigma for each class. I tried
select value,class,AVG(value) as mean, STDEV(value)as st, size from Data
having value<mean-2*st OR value>mean+2*st group by calss
it seems does not work. Should I use having or where clause here?
The results I want is the whole 3rd row and 8th row.
When the condition you are looking at is a property of the row, use where
i.e. where class = 1
(all rows with class 1) or where size > 2
(all rows with size > 2). When the condition is a property of a set of rows you use group by ... having
e.g. group by class having avg(value) > 2
(all classes with average value > 2).
In this case you want where
but there is a complication. You don't have enough information in each row alone to write the necessary where
clause, so you will have to get it through a subquery.
Ultimately you want something like SELECT value, class, size FROM Data WHERE value < mean - 2 *st OR value > mean + 2*st;
however you need a subquery to get mean
and st
.
One way to do this is:
SELECT value, Data.class, size, mean, st FROM Data,
INNER JOIN (
SELECT class, AVG(value) AS mean, STDEV(value) AS st
FROM Data GROUP BY class
) AS stats ON stats.class = Data.class
WHERE value < mean - 2 * st OR value > mean + 2 * st;
This creates a subquery which gets your means and standard deviations for each class, joins those numbers to the rows with matching classes, and then applies your outlier check.