I have to present some data to work colleagues and i am having issues analysing it in MySQL.
I have 1 table called 'payments'. Each payment has columns for:
I have written pretty simple queries like:
SELECT
AVG(amount_GBP),
COUNT(client) AS '#Of Results'
FROM payments
WHERE client_type = 'individual'
AND amount_gbp IS NOT NULL
AND currency = 'TRY'
AND country_origin = 'GB'
AND date_time BETWEEN '2017/1/1' AND '2017/9/1'
But what i really need to do is eliminate outliers from the average AND/OR only include results within a number of Standard Deviations from the Mean.
For example, ignore the top/bottom 10 results of 2% of results etc. AND/OR ignore any results that fall outside of 2 STDEVs from the Mean
Can anyone help?
--- EDITED ANSWER -- TRY AND LET ME KNOW ---
Your best best is to create a TEMPORARY table with the avg and std_dev values and compare against them. Let me know if that is not feasible:
CREATE TEMPORARY TABLE payment_stats AS
SELECT
AVG(p.amount_gbp) as avg_gbp,
STDDEV(amount_gbp) as std_gbp,
(SELECT MIN(srt.amount_gbp) as max_gbp
FROM (SELECT amount_gbp
FROM payments
<... repeat where no p. ...>
ORDER BY amount_gbp DESC
LIMIT <top_numbers to ignore>
) srt
) max_g,
(SELECT MAX(srt.amount_gbp) as min_gbp
FROM (SELECT amount_gbp
FROM payments
<... repeat where no p. ...>
ORDER BY amount_gbp ASC
LIMIT <top_numbers to ignore>
) srt
) min_g
FROM payments
WHERE client_type = 'individual'
AND amount_gbp IS NOT NULL
AND currency = 'TRY'
AND country_origin = 'GB'
AND date_time BETWEEN '2017/1/1' AND '2017/9/1';
You can then compare against the temp table
SELECT
AVG(p.amount_gbp) as avg_gbp,
COUNT(p.client) AS '#Of Results'
FROM payments p
WHERE
p.amount_gbp >= (SELECT (avg_gbp - std_gbp*2)
FROM payment_stats)
AND p.amount_gbp <= (SELECT (avg_gbp + std_gbp*2)
FROM payment_stats)
AND p.amount_gbp > (SELECT min_g FROM payment_stats)
AND p.amount_gbp < (SELECT max_g FROM payment_stats)
AND p.client_type = 'individual'
AND p.amount_gbp IS NOT NULL
AND p.currency = 'TRY'
AND p.country_origin = 'GB'
AND p.date_time BETWEEN '2017/1/1' AND '2017/9/1';
-- Later on
DROP TEMPORARY TABLE payment_stats;
Notice I had to repeat the WHERE condition. Also change *2
to whatever <factor>
to what you need!
Still Phew!
Each compare will check a different stat
Let me know if this is better