Search code examples
sql-serverrelationshipregion

Find "regional" relationships in SQL data using a query, or SSIS


Edit for clarification: I am compiling data weekly, based on Zip_Code, but some Zip_Codes are redundant. I know I should be able to compile a small amount of data, and derive the redundant zip_codes if I can establish relationships.

I want to define a zip code's region by the unique set of items and values that appear in that zip code, in order to create a "Region Table"

I am looking to find relationships by zip code with certain data. Ultimately, I have tables which include similar values for many zip codes.

I have data similar to:

ItemCode   |Value  | Zip_Code
-----------|-------|-------
1          |10     | 1
2          |15     | 1
3          |5      | 1 
1          |10     | 2
2          |15     | 2
3          |5      | 2
1          |10     | 3
2          |10     | 3
3          |15     | 3

Or to simplify the idea, I could even concantenate ItemCode + Value into unique values:

ItemCode+
Value    | Zip_Code
A        | 1
B        | 1
C        | 1
A        | 2
B        | 2
C        | 2
A        | 3
D        | 3
E        | 3    

As you can see, Zip_Code 1 and 2 have the same distinct ItemCode and Value. Zip_Code 3 however, has different values for certain ItemCodes.

I need to create a table that establishes a relationship between Zip_Codes that contain the same data.

The final table will look something like:

Zip_Code | Region
1        | 1
2        | 1
3        | 2
4        | 2
5        | 1
6        | 3
...etc

This will allow me to collect data only once for each unique Region, and derive the zip_code appropriately.

Things I'm doing now:

I am currently using a query similar to a join, and compares against Zip_Code using something along the lines of:

SELECT a.ItemCode
      ,a.value
      ,a.zip_code
      ,b.ItemCode
      ,b.value
      ,b.zip_code
FROM mytable as a, mytable as b -- select from table twice, similar to a join
WHERE a.zip_code = 1  -- left table will have all ItemCode and Value from zip 1
  AND b.zip_code = 2  -- right table will have all ItemCode and Value from zip 2
  AND a.ItemCode = b.ItemCode -- matches rows on ItemCode 
  AND a.Value != b.Value
ORDER BY ItemCode

This returns nothing if the two zip codes have exactly the same ItemNum, and Value, and returns a slew of differences between the two zip codes if there are differences.

This needs to move from a manual process to an automated process however, as I am now working with more than 100 zip_codes.

I do not have much programming experience in specific languages, so tools in SSIS are somewhat limited to me. I have some experience using the Fuzzy tools, and feel like there might be something in Fuzzy Grouping that might shine a light on apparent regions, but can't figure out how to set it up.

Does anyone have any suggestions? I have access to SQLServ and its related tools, and Visual Studio. I am trying to avoid writing a program to automate this, as my c# skills are relatively nooby, but will figure it out if necessary.

Sorry for being so verbose: This is my first Question, and the page I agreed to in order to ask a question suggested to explain in detail, and talk about what I've tried... Thanks in advance for any help I might receive.


Solution

  • Give this a shot (I used the simplified example, but this can easily be expanded). I think the real interesting part of this code is the recursive CTE...

    ;with matches as (
        --Find all pairs of zip_codes that have matching values.
        select d1.ZipCode zc1, d2.ZipCode zc2
        from data d1
            join data d2 on d1.Val=d2.Val
        group by d1.ZipCode, d2.ZipCode
        having count(*) = (select count(distinct Val) from data where zipcode = d1.Zipcode)
    ), cte as (
        --Trace each zip_code to it's "smallest" matching zip_code id.
        select zc1 tempRegionID, zc2 ZipCode
        from matches
        where zc1<=zc2
        UNION ALL
        select c.tempRegionID, m.zc2
        from cte c
            join matches m on c.ZipCode=m.zc1
                          and c.ZipCode!=m.zc2
        where m.zc1<=m.zc2
    )
    --For each zip_code, use it's smallest matching zip_code as it's region.
    select zipCode, min(tempRegionID) as regionID
    from cte
    group by ZipCode