sql machine-learning sql-server-2008-r2 data-cleaning

How to detect outlier data points on my database

I currently have a database that has some has mislabeled prices on it.

Example Data:

Product - Price - SalesDate
ProdA  - 10  - 1/1/2016
ProdB  - 20  - 1/2/2016
ProdA  - 100 - 1/3/2016
ProdB  - 20  - 1/4/2016
ProdB  - 21  - 1/5/2016
ProdA  - 11  - 1/6/2016

On this data set the record "ProdA - 100 - 1/3/2016" is the one with an error. Must probably the person who entered the price made a typing mistake. Furthermore, ProdA on different days can changes it price and that makes this problem interesting.

What kind of tool can help my identify this type of records? Can SQL help me detect Outlier data points? Should I start looking into Machine Learning for this?

Solution

This is a bit subjective, but you can identify the rows whose values are furthest from the average. I would do this by calculating the z-score and looking at the largest/smallest z-scores.

The z-score is the value minus the average divided by the standard deviation. Here is an example of the calculation:

select t.*,
       (price - avg_price) / nullif(std_price, 0) as z_price
from t join
     (select product, avg(price) as avg_price, stdev(price) as std_price
      from t
      group by product
     ) tt
     on t.product = tt.product
order by abs(z_price) desc;

The function for the standard deviation might vary, depending on the database you are using, but most databases support such a function.