I need to create a concatenated column based on two other columns in the table. One column is the year (10 distinct values), and one is a person's ID value (~150,000 distinct values). This is being used as a business key by an ETL task that will not accept multiple columns as the key value, so I need to persist this value in my database.
That said, I can choose how to create that value, and I'm wondering if the order (ID + Year or Year + ID) affects performance in any way. If the year goes first, the first four characters will always be one of a limited set of actual years. If the ID goes first, that will change for each user. Is there any difference between the two?
The only usage this column will get is during the ETL load, where it will be used to join data from the source and staging tables to check for differences between the two. The base values will be in the table underneath, and I plan on creating a clustered index on those base values.
Sample Data:
╔══════════════╦══════════════╦═════════╦═════════╗
║ COMPOSITE_1 ║ COMPOSITE_2 ║ AC_YEAR ║ ST_ID ║
╠══════════════╬══════════════╬═════════╬═════════╣
║ 0000001|2005 ║ 2005|0000001 ║ 2005 ║ 0000001 ║
║ 0000001|2006 ║ 2006|0000001 ║ 2006 ║ 0000001 ║
║ 0000001|2009 ║ 2009|0000001 ║ 2009 ║ 0000001 ║
║ 0000001|2010 ║ 2010|0000001 ║ 2010 ║ 0000001 ║
║ 0000001|2012 ║ 2012|0000001 ║ 2012 ║ 0000001 ║
║ 0000001|2013 ║ 2013|0000001 ║ 2013 ║ 0000001 ║
║ 0000002|2005 ║ 2005|0000002 ║ 2005 ║ 0000002 ║
║ 0000002|2006 ║ 2006|0000002 ║ 2006 ║ 0000002 ║
║ 0000002|2007 ║ 2007|0000002 ║ 2007 ║ 0000002 ║
║ 0000002|2008 ║ 2008|0000002 ║ 2008 ║ 0000002 ║
║ 0000002|2009 ║ 2009|0000002 ║ 2009 ║ 0000002 ║
║ 0000002|2010 ║ 2010|0000002 ║ 2010 ║ 0000002 ║
║ 0000002|2012 ║ 2012|0000002 ║ 2012 ║ 0000002 ║
║ 0000002|2013 ║ 2013|0000002 ║ 2013 ║ 0000002 ║
║ 0000002|2014 ║ 2014|0000002 ║ 2014 ║ 0000002 ║
╚══════════════╩══════════════╩═════════╩═════════╝
Question One: would either Composite_1 or Composite_2 give me better performance during the JOIN?
Question Two: would I ever need to index the Composite column, and if so should I do it alone/with others? The SSIS task will be using it for an in-memory JOIN, and I plan to include it as an ORDER BY within my OLE DB Source component.
Question Three: Does the clustered index belong on the Composite column, or the Year and ID columns? EDIT: Or, since I know that Year and ID won't affect the order of Composite, should I just include all three?