Search code examples
sqlstatisticsdeviation

filter out deviating record with sql


We have this set of data that we need to get the average of a column. a select avg(x) from y does the trick. However we need a more accurate figure.

I figured that there must be a way of filtering records that has either too high or too low values(spikes) so that we can exclude them in calculating the average.


Solution

  • There are three types of average, and what you are originally using is the mean - the sum of all the values divided by the number of values.

    You might find it more useful to get the mode - the most frequently occuring value:

    select name,    
           (select top 1 h.run_duration
            from sysjobhistory h
            where h.step_id = 0
            and h.job_id = j.job_id
            group by h.run_duration
            order by count(*) desc) run_duration
    from sysjobs j
    

    If you did want to get rid of any values outside the original standard deviation, you could find the average and the standard deviation in a subquery, eliminate those values which are outside the range : average +- standard deviation, then do a further average of the remaining values, but you start running the risk of having meaningless values:

    select oh.job_id, avg(oh.run_duration) from sysjobhistory oh
    inner join (select job_id, avg(h.run_duration) avgduration, 
                stdev(h.run_duration) stdev_duration 
                from sysjobhistory h 
                group by job_id) as m on m.job_id = oh.job_id
    where oh.step_id = 0
    and abs(oh.run_duration - m.avgduration) <  m.stdev_duration
    group by oh.job_id