There are 100 suppliers, each with between 50 and 1000 items. Each supplier may have items close to their office or spread across an entire country or continent.
As LatLngs are input by a human, some mistakes happen. With lots of data and constant 'churn', mistakes are difficult to identify.
To improve data quality, I want to identify outliers for each supplier so that they can be fixed. If a supplier's items are mostly near New York, one in California would be an outlier.
SUPPLIERS
SupplierID int
Latitude DECIMAL(12,9)
Longitude DECIMAL(12,9)
ITEMS
ItemID int
SupplierID int
LatLng geography
I assume I need to use standard deviation for this, but putting it into T-SQL is giving me a headache.
I'd like to output a list of outliers for each supplier, based on each supplier's specific deviation.
This code outputs Items and the distance between each item and the supplier's office.
WITH cte AS
(
SELECT
ItemID,
SupplierID,
LatLng,
LatLng.STDistance(GEOGRAPHY::Point(a.Latitude, a.Longitude, 4326))/1000 As Distance
FROM
Items v
JOIN
Suppliers a ON v.SupplierID = a.SupplierID
)
SELECT
ItemID, SupplierID, Distance
FROM cte
Here's the SQL functionality for standard deviation (from a blog post):
DECLARE @StdDev DECIMAL(5,2)
DECLARE @Avg DECIMAL(5,2)
SELECT
@StdDev = STDEV(Qty),
@Avg = AVG(Qty)
FROM Sales
SELECT
*
FROM
Sales
WHERE
Qty > @Avg - @StdDev AND
Qty < @Avg + @StdDev
STEPS I NEED TO DO
STDEV
and AVG
for distance, GROUP BY SupplierID
AVG + STDEV
for the item's supplierThis is where I'm scratching my head as this is multiple steps AFTER the multiple steps I've already performed. I guess I could insert what I have into a TEMP table and go from there, but is that really the best way?
You can use window functions for this. Both AVG
and STDEV
are available as window functions
WITH Distances AS
(
SELECT
i.ItemID,
s.SupplierID,
i.LatLng,
v.SupplierLocation,
i.LatLng.STDistance(v.SupplierLocation)/1000 As Distance
FROM
Items i
JOIN
Suppliers s ON i.SupplierID = s.SupplierID
CROSS APPLY (VALUES (
GEOGRAPHY::Point(s.Latitude, s.Longitude, 4326)
)) v(SupplierLocation)
),
Averages AS (
SELECT
ItemID,
SupplierID,
LatLng,
SupplierLocation
Distance,
AVG(Distance) OVER (PARTITION BY SupplierID) AS Avg,
STDEV(Distance) OVER (PARTITION BY SupplierID) AS StDev
FROM
Distances
)
SELECT
ItemID,
SupplierID,
Distance,
Avg,
StDev
FROM
Averages
WHERE
Distance > Avg - StdDev AND
Distance < Avg + StdDev;