Search code examples
sql-serverstandard-deviation

Google Maps SQL Server : calculating outlier geographic data within group


There are 100 suppliers, each with between 50 and 1000 items. Each supplier may have items close to their office or spread across an entire country or continent.

As LatLngs are input by a human, some mistakes happen. With lots of data and constant 'churn', mistakes are difficult to identify.

To improve data quality, I want to identify outliers for each supplier so that they can be fixed. If a supplier's items are mostly near New York, one in California would be an outlier.

SUPPLIERS

SupplierID int
Latitude DECIMAL(12,9)
Longitude DECIMAL(12,9)

ITEMS

ItemID int
SupplierID int
LatLng geography

I assume I need to use standard deviation for this, but putting it into T-SQL is giving me a headache.

I'd like to output a list of outliers for each supplier, based on each supplier's specific deviation.

This code outputs Items and the distance between each item and the supplier's office.

WITH cte AS 
(
    SELECT 
        ItemID,
        SupplierID,
        LatLng,
        LatLng.STDistance(GEOGRAPHY::Point(a.Latitude, a.Longitude, 4326))/1000 As Distance
    FROM 
        Items v
    JOIN 
        Suppliers a ON v.SupplierID = a.SupplierID
)
SELECT 
    ItemID, SupplierID, Distance 
FROM cte

Here's the SQL functionality for standard deviation (from a blog post):

DECLARE @StdDev DECIMAL(5,2) 
DECLARE @Avg DECIMAL(5,2)

SELECT 
   @StdDev = STDEV(Qty), 
   @Avg = AVG(Qty) 
FROM Sales

SELECT 
   * 
FROM 
   Sales 
WHERE 
   Qty > @Avg - @StdDev AND
   Qty < @Avg + @StdDev

STEPS I NEED TO DO

  1. Calculate STDEV and AVG for distance, GROUP BY SupplierID
  2. Output items where the distance is greater than AVG + STDEV for the item's supplier

This is where I'm scratching my head as this is multiple steps AFTER the multiple steps I've already performed. I guess I could insert what I have into a TEMP table and go from there, but is that really the best way?


Solution

  • You can use window functions for this. Both AVG and STDEV are available as window functions

    WITH Distances AS 
    (
        SELECT 
            i.ItemID,
            s.SupplierID,
            i.LatLng,
            v.SupplierLocation,
            i.LatLng.STDistance(v.SupplierLocation)/1000 As Distance
        FROM 
            Items i
        JOIN 
            Suppliers s ON i.SupplierID = s.SupplierID
        CROSS APPLY (VALUES (
            GEOGRAPHY::Point(s.Latitude, s.Longitude, 4326)
        )) v(SupplierLocation)
    ),
    Averages AS (
        SELECT
            ItemID,
            SupplierID,
            LatLng,
            SupplierLocation
            Distance,
            AVG(Distance)   OVER (PARTITION BY SupplierID) AS Avg,
            STDEV(Distance) OVER (PARTITION BY SupplierID) AS StDev
        FROM
            Distances
    )
    SELECT 
        ItemID,
        SupplierID,
        Distance,
        Avg,
        StDev
    FROM
        Averages
    WHERE 
        Distance > Avg - StdDev AND
        Distance < Avg + StdDev;