I'm trying to find a way to develop groups of things based on data points that are related to the individual 'things' but also to each other.
For instance, let's say I'm trying to group junk mail that I receive at home. When I receive the letter, I record:
Let's also say that, over time, I see that that same return address and/or phone number shows up with different business names. I can infer that all of these letters were probably sent by the same business. I can associate the addresses, phone numbers and and business names together as the same 'entity'.
Or, I see completely different addresses and phones numbers, but the size, color, postmark and font on the envelope are all the same. I can infer (with less certainty) that these may also be from the same business.
What I'm looking for is the best way to take data of this sort and group it into 'buckets' (entities) based on the overlapping data... using SQL Server, Analysis Services or some combination thereof... so that I end up with a way to input a single data point to see if it's related to any others (e.g. enter a phone number from a letter to see the entity, or group of letters, it's related to).
Can someone please point me in the right direction?
Thanks in advance!
For each record I would ask, "how unique is this?" Then start breaking it down logically based upon that...
State: Low unique dataset with potential for mass duplication; create a [State] table with an identity column and if possible pre-populate it with all possible values to reduce index fragmentation on the non-clustered index.
Create Table [dbo].[State] (StateID Int Identity, StateName Varchar(32))
Create Unique Clustered Index ix_stateID On [dbo].[State] (StateID)
Create Unique NonClustered Index ix_SN On [dbo].[State] (StateName)
ZipCode: Medium unique dataset with decent chance for duplication, but every ZipCode is associated to a single state. Once again, prepopulating this one could be useful to avoid progressive index fragmentation, but depending on how fast you're expecting this to grow it might make sense to just let it fill up as it will and re-index periodically. If you're only going to track US addresses and only the first five digits pre-populating would be fine (and change the ZipCode column to an Int if you're doing this).
Create Table [dbo].[ZipCode] (ZipCodeID Int Identity, StateID Int, ZipCode Varchar(16))
Create Unique Clustered Index ix_zipcodeID On [dbo].[ZipCode] (ZipCodeID)
Create Unique NonClustered Index ix_stateID_ZC On [dbo].[ZipCode] (StateID, ZipCode)
City: This table will have a decently large dataset, but still has the opportunity for tons of duplication so we'll once again create an identity value, but this time I would definitely not prepopulate.
Create Table [dbo].[City] (CityID Int Identity, ZipCodeID Int, CityName Varchar(64))
Create Unique Clustered Index ix_cityID On [dbo].[City] (CityID)
Create Unique NonClustered Index ix_zipcodeID_C On [dbo].[City] (ZipCodeID, City)
StreetAddress: This is as selective as we're going to get with out addresses, but we still want to create an ID column since we can receive tons of mail from the same address.
Create Table [dbo].[StreetAddress] (StreetAddressID Int Identity, CityID Int, StreetAddress Varchar(256))
Create Unique Clustered Index ix_streetaddressID On [dbo].[StreetAddress] (StreetAddressID)
Create Unique NonClustered Index ix_cityID_SA On [dbo].[StreetAddress] (CityID, StreetAddress)
For the phone number I would probably break it down by [AreaCode] and [PhoneNumber] so...
Create Table [dbo].[AreaCode] (AreaCodeID Int Identity, AreaCode Int)
Create Unique Clustered Index ix_areacodeID On [dbo].[AreaCode] (AreaCodeID)
Create Unique NonClustered Index ix_AC On [dbo].[AreaCode] (AreaCode)
Create Table [dbo].[PhoneNumber] (PhoneNumberID Int Identity, AreaCodeID Int, PhoneNumber Int)
Create Unique Clustered Index ix_phonenumberID On [dbo].[PhoneNumber] (PhoneNumberID)
Create Unique NonClustered Index ix_acID_PN On [dbo].[PhoneNumber] (AreaCodeID, PhoneNumber)
Then for the rest I would create single depth look up tables (Size,Color,Font,etc)
Create Table [dbo].[Characteristic] (CharacteristicID Int Identity, Characteristic AppropriateDataType)
Create Unique Clustered Index ix_characteristicID On [dbo].[Characteristic] (CharacteristicID)
Create Unique NonClustered Index ix_abrevCharact On [dbo].[Characteristic] (Characteristic)
Then finally have your most unique item, your mail...
Create Table [dbo].[Letter] (LetterID Int Identity, Received DateTime, StreetAddressID Int, PhoneNumberID Int, CharacteristicIDs ...)
Figure out what indexes make sense on your [dbo].[Letter] table based upon what queries you run most often and efficient querying should be as simple as writing the appropriate queries with the necessary joins and logic. :)
That's my 2 cents.