Search code examples
sqloracle-databaseaveragesql-navigator

Oracle SQL Query to calculate the average of a data set, excluding outliers


I have a query which contains the correct conditions and fields that i want to display:

  SELECT t.business_process_id,
         COUNT (tsp.status) AS COUNT,
         ROUND (AVG (tsp.end_date - tsp.start_date), 2) * 24 * 60 AS average,
         ROUND (MAX (tsp.end_date - tsp.start_date), 2) * 24 * 60 AS MAX,
         ROUND (MIN (tsp.end_date - tsp.start_date), 2) * 24 * 60 AS MIN,
         ROUND (MEDIAN (tsp.end_date - tsp.start_date), 2) * 24 * 60 AS MEDIAN,
         ROUND (STDDEV (tsp.end_date - tsp.start_date), 2) AS std_deviation
    FROM transaction_status_period tsp, transaction t
   WHERE     t.trans_id = tsp.trans_id
         AND tsp.status = 'R'
         AND tsp.end_date IS NOT NULL
         AND tsp.userid NOT IN ('X', 'Y', 'Z', 'A')
         AND EXTRACT (DAY FROM tsp.start_date) =
                 EXTRACT (DAY FROM tsp.end_date)
         AND EXTRACT (YEAR FROM tsp.start_date) =
                 EXTRACT (YEAR FROM tsp.end_date)
         AND EXTRACT (MONTH FROM tsp.start_date) =
                 EXTRACT (MONTH FROM tsp.end_date)
         AND EXTRACT (YEAR FROM tsp.start_date) = 2013
         AND NOT EXISTS
                     (SELECT 1
                        FROM transaction_status_period tsp1
                       WHERE     tsp1.trans_id = tsp.trans_id
                             AND tsp.userid = tsp1.userid
                             AND tsp1.status = 'S'
                             AND tsp1.timestamp < tsp.timestamp)
GROUP BY t.business_process_id

The average calculated by the query is of the whole data set in question (for the year = 2013). Is there a way to get the query to calculate the average of all data from 2013 excluding outliers? I.e. find the average of the date difference of (tsp.end_date - tsp.start_date) where, the majority of observations fall, for the year 2013?

Would the percentile_cont function work? I am not familiar with it but i do know it calculates the percentile of a particular column. In my case, I'm looking for the average date difference between (tsp.end_date - tsp.start_date), but the average of majority of the data points (excluding outliers).

Any help would be much appreciated. Perhaps I am tackling this query in the wrong way.


Solution

  • Would something like this solve your problem?

    Calculate your average and standard deviation in an inline view and then use that to define your outliers. Assuming you consider an outlier twice the standard of the mean then:

    SELECT calc.business_process_id,
     COUNT (calc.status) AS COUNT,
     ROUND (AVG (calc.end_date - calc.start_date), 2) * 24 * 60 AS average,
     ROUND (MAX (calc.end_date - calc.start_date), 2) * 24 * 60 AS MAX,
     ROUND (MIN (calc.end_date - calc.start_date), 2) * 24 * 60 AS MIN,
     ROUND (MEDIAN (calc.end_date - calc.start_date), 2) * 24 * 60 AS MEDIAN,
     ROUND (STDDEV (calc.end_date - calc.start_date), 2) AS std_deviation
    FROM (SELECT t.business_process_id,
             tsp.status,
             tsp.start_date,
             tsp.end_date, 
             ntile(100) over (order by (tsp.end_date-tsp.start_date)) as percentiles
          FROM transaction_status_period tsp, transaction t 
          WHERE     t.trans_id = tsp.trans_id
          AND tsp.status = 'R'
          AND tsp.end_date IS NOT NULL
          AND tsp.userid NOT IN ('X', 'Y', 'Z', 'A')
          AND EXTRACT (DAY FROM tsp.start_date) =
              EXTRACT (DAY FROM tsp.end_date)
          AND EXTRACT (YEAR FROM tsp.start_date) =
              EXTRACT (YEAR FROM tsp.end_date)
          AND EXTRACT (MONTH FROM tsp.start_date) =
              EXTRACT (MONTH FROM tsp.end_date)
          AND EXTRACT (YEAR FROM tsp.start_date) = 2013
          AND NOT EXISTS
                 (SELECT 1
                    FROM transaction_status_period tsp1
                   WHERE     tsp1.trans_id = tsp.trans_id
                         AND tsp.userid = tsp1.userid
                         AND tsp1.status = 'S'
                         AND tsp1.timestamp < tsp.timestamp)
      ) calc
    WHERE calc.percentiles >=10 
    AND calc.percentiles <=90
    GROUP BY calc.business_process_id