I have a query which contains the correct conditions and fields that i want to display:
SELECT t.business_process_id,
COUNT (tsp.status) AS COUNT,
ROUND (AVG (tsp.end_date - tsp.start_date), 2) * 24 * 60 AS average,
ROUND (MAX (tsp.end_date - tsp.start_date), 2) * 24 * 60 AS MAX,
ROUND (MIN (tsp.end_date - tsp.start_date), 2) * 24 * 60 AS MIN,
ROUND (MEDIAN (tsp.end_date - tsp.start_date), 2) * 24 * 60 AS MEDIAN,
ROUND (STDDEV (tsp.end_date - tsp.start_date), 2) AS std_deviation
FROM transaction_status_period tsp, transaction t
WHERE t.trans_id = tsp.trans_id
AND tsp.status = 'R'
AND tsp.end_date IS NOT NULL
AND tsp.userid NOT IN ('X', 'Y', 'Z', 'A')
AND EXTRACT (DAY FROM tsp.start_date) =
EXTRACT (DAY FROM tsp.end_date)
AND EXTRACT (YEAR FROM tsp.start_date) =
EXTRACT (YEAR FROM tsp.end_date)
AND EXTRACT (MONTH FROM tsp.start_date) =
EXTRACT (MONTH FROM tsp.end_date)
AND EXTRACT (YEAR FROM tsp.start_date) = 2013
AND NOT EXISTS
(SELECT 1
FROM transaction_status_period tsp1
WHERE tsp1.trans_id = tsp.trans_id
AND tsp.userid = tsp1.userid
AND tsp1.status = 'S'
AND tsp1.timestamp < tsp.timestamp)
GROUP BY t.business_process_id
The average calculated by the query is of the whole data set in question (for the year = 2013). Is there a way to get the query to calculate the average of all data from 2013 excluding outliers? I.e. find the average of the date difference of (tsp.end_date - tsp.start_date)
where, the majority of observations fall, for the year 2013?
Would the percentile_cont
function work? I am not familiar with it but i do know it calculates the percentile of a particular column. In my case, I'm looking for the average date difference between (tsp.end_date - tsp.start_date)
, but the average of majority of the data points (excluding outliers).
Any help would be much appreciated. Perhaps I am tackling this query in the wrong way.
Would something like this solve your problem?
Calculate your average and standard deviation in an inline view and then use that to define your outliers. Assuming you consider an outlier twice the standard of the mean then:
SELECT calc.business_process_id,
COUNT (calc.status) AS COUNT,
ROUND (AVG (calc.end_date - calc.start_date), 2) * 24 * 60 AS average,
ROUND (MAX (calc.end_date - calc.start_date), 2) * 24 * 60 AS MAX,
ROUND (MIN (calc.end_date - calc.start_date), 2) * 24 * 60 AS MIN,
ROUND (MEDIAN (calc.end_date - calc.start_date), 2) * 24 * 60 AS MEDIAN,
ROUND (STDDEV (calc.end_date - calc.start_date), 2) AS std_deviation
FROM (SELECT t.business_process_id,
tsp.status,
tsp.start_date,
tsp.end_date,
ntile(100) over (order by (tsp.end_date-tsp.start_date)) as percentiles
FROM transaction_status_period tsp, transaction t
WHERE t.trans_id = tsp.trans_id
AND tsp.status = 'R'
AND tsp.end_date IS NOT NULL
AND tsp.userid NOT IN ('X', 'Y', 'Z', 'A')
AND EXTRACT (DAY FROM tsp.start_date) =
EXTRACT (DAY FROM tsp.end_date)
AND EXTRACT (YEAR FROM tsp.start_date) =
EXTRACT (YEAR FROM tsp.end_date)
AND EXTRACT (MONTH FROM tsp.start_date) =
EXTRACT (MONTH FROM tsp.end_date)
AND EXTRACT (YEAR FROM tsp.start_date) = 2013
AND NOT EXISTS
(SELECT 1
FROM transaction_status_period tsp1
WHERE tsp1.trans_id = tsp.trans_id
AND tsp.userid = tsp1.userid
AND tsp1.status = 'S'
AND tsp1.timestamp < tsp.timestamp)
) calc
WHERE calc.percentiles >=10
AND calc.percentiles <=90
GROUP BY calc.business_process_id