Search code examples
sql-serversql-server-2008ssasdata-miningbusiness-intelligence

Build Groups Using Related Data Points


I'm trying to find a way to develop groups of things based on data points that are related to the individual 'things' but also to each other.

For instance, let's say I'm trying to group junk mail that I receive at home. When I receive the letter, I record:

  • The postmark city and state
  • The return address street, business name, city, state and zip
  • The phone number of the business
  • The size and color of the envelope
  • The font used when printing the envelope

Let's also say that, over time, I see that that same return address and/or phone number shows up with different business names. I can infer that all of these letters were probably sent by the same business. I can associate the addresses, phone numbers and and business names together as the same 'entity'.

Or, I see completely different addresses and phones numbers, but the size, color, postmark and font on the envelope are all the same. I can infer (with less certainty) that these may also be from the same business.

What I'm looking for is the best way to take data of this sort and group it into 'buckets' (entities) based on the overlapping data... using SQL Server, Analysis Services or some combination thereof... so that I end up with a way to input a single data point to see if it's related to any others (e.g. enter a phone number from a letter to see the entity, or group of letters, it's related to).

Can someone please point me in the right direction?

Thanks in advance!


Solution

  • For each record I would ask, "how unique is this?" Then start breaking it down logically based upon that...

    State: Low unique dataset with potential for mass duplication; create a [State] table with an identity column and if possible pre-populate it with all possible values to reduce index fragmentation on the non-clustered index.

    Create  Table [dbo].[State] (StateID Int Identity, StateName Varchar(32))
    Create  Unique Clustered Index ix_stateID On [dbo].[State] (StateID)
    Create  Unique NonClustered Index ix_SN On [dbo].[State] (StateName)
    

    ZipCode: Medium unique dataset with decent chance for duplication, but every ZipCode is associated to a single state. Once again, prepopulating this one could be useful to avoid progressive index fragmentation, but depending on how fast you're expecting this to grow it might make sense to just let it fill up as it will and re-index periodically. If you're only going to track US addresses and only the first five digits pre-populating would be fine (and change the ZipCode column to an Int if you're doing this).

    Create  Table [dbo].[ZipCode] (ZipCodeID Int Identity, StateID Int, ZipCode Varchar(16))
    Create  Unique Clustered Index ix_zipcodeID On [dbo].[ZipCode] (ZipCodeID)
    Create  Unique NonClustered Index ix_stateID_ZC On [dbo].[ZipCode] (StateID, ZipCode)
    

    City: This table will have a decently large dataset, but still has the opportunity for tons of duplication so we'll once again create an identity value, but this time I would definitely not prepopulate.

    Create  Table [dbo].[City] (CityID Int Identity, ZipCodeID Int, CityName Varchar(64))
    Create  Unique Clustered Index ix_cityID On [dbo].[City] (CityID)
    Create  Unique NonClustered Index ix_zipcodeID_C On [dbo].[City] (ZipCodeID, City)
    

    StreetAddress: This is as selective as we're going to get with out addresses, but we still want to create an ID column since we can receive tons of mail from the same address.

    Create  Table [dbo].[StreetAddress] (StreetAddressID Int Identity, CityID Int, StreetAddress Varchar(256))
    Create  Unique Clustered Index ix_streetaddressID On [dbo].[StreetAddress] (StreetAddressID)
    Create  Unique NonClustered Index ix_cityID_SA On [dbo].[StreetAddress] (CityID, StreetAddress)
    

    For the phone number I would probably break it down by [AreaCode] and [PhoneNumber] so...

    Create  Table [dbo].[AreaCode] (AreaCodeID Int Identity, AreaCode Int)
    Create  Unique Clustered Index ix_areacodeID On [dbo].[AreaCode] (AreaCodeID)
    Create  Unique NonClustered Index ix_AC On [dbo].[AreaCode] (AreaCode)
    
    Create  Table [dbo].[PhoneNumber] (PhoneNumberID Int Identity, AreaCodeID Int, PhoneNumber Int)
    Create  Unique Clustered Index ix_phonenumberID On [dbo].[PhoneNumber] (PhoneNumberID)
    Create  Unique NonClustered Index ix_acID_PN On [dbo].[PhoneNumber] (AreaCodeID, PhoneNumber)
    

    Then for the rest I would create single depth look up tables (Size,Color,Font,etc)

    Create  Table [dbo].[Characteristic] (CharacteristicID Int Identity, Characteristic AppropriateDataType)
    Create  Unique Clustered Index ix_characteristicID On [dbo].[Characteristic] (CharacteristicID)
    Create  Unique NonClustered Index ix_abrevCharact On [dbo].[Characteristic] (Characteristic)
    

    Then finally have your most unique item, your mail...

    Create  Table [dbo].[Letter] (LetterID Int Identity, Received DateTime, StreetAddressID Int, PhoneNumberID Int, CharacteristicIDs ...)
    

    Figure out what indexes make sense on your [dbo].[Letter] table based upon what queries you run most often and efficient querying should be as simple as writing the appropriate queries with the necessary joins and logic. :)

    That's my 2 cents.