Search code examples
sql-serverperformancessisindexingscd

Performance of concatenated column - does order matter?


I need to create a concatenated column based on two other columns in the table. One column is the year (10 distinct values), and one is a person's ID value (~150,000 distinct values). This is being used as a business key by an ETL task that will not accept multiple columns as the key value, so I need to persist this value in my database.

That said, I can choose how to create that value, and I'm wondering if the order (ID + Year or Year + ID) affects performance in any way. If the year goes first, the first four characters will always be one of a limited set of actual years. If the ID goes first, that will change for each user. Is there any difference between the two?

The only usage this column will get is during the ETL load, where it will be used to join data from the source and staging tables to check for differences between the two. The base values will be in the table underneath, and I plan on creating a clustered index on those base values.

Sample Data:

╔══════════════╦══════════════╦═════════╦═════════╗
║ COMPOSITE_1  ║ COMPOSITE_2  ║ AC_YEAR ║  ST_ID  ║
╠══════════════╬══════════════╬═════════╬═════════╣
║ 0000001|2005 ║ 2005|0000001 ║    2005 ║ 0000001 ║
║ 0000001|2006 ║ 2006|0000001 ║    2006 ║ 0000001 ║
║ 0000001|2009 ║ 2009|0000001 ║    2009 ║ 0000001 ║
║ 0000001|2010 ║ 2010|0000001 ║    2010 ║ 0000001 ║
║ 0000001|2012 ║ 2012|0000001 ║    2012 ║ 0000001 ║
║ 0000001|2013 ║ 2013|0000001 ║    2013 ║ 0000001 ║
║ 0000002|2005 ║ 2005|0000002 ║    2005 ║ 0000002 ║
║ 0000002|2006 ║ 2006|0000002 ║    2006 ║ 0000002 ║
║ 0000002|2007 ║ 2007|0000002 ║    2007 ║ 0000002 ║
║ 0000002|2008 ║ 2008|0000002 ║    2008 ║ 0000002 ║
║ 0000002|2009 ║ 2009|0000002 ║    2009 ║ 0000002 ║
║ 0000002|2010 ║ 2010|0000002 ║    2010 ║ 0000002 ║
║ 0000002|2012 ║ 2012|0000002 ║    2012 ║ 0000002 ║
║ 0000002|2013 ║ 2013|0000002 ║    2013 ║ 0000002 ║
║ 0000002|2014 ║ 2014|0000002 ║    2014 ║ 0000002 ║
╚══════════════╩══════════════╩═════════╩═════════╝ 

Question One: would either Composite_1 or Composite_2 give me better performance during the JOIN?

Question Two: would I ever need to index the Composite column, and if so should I do it alone/with others? The SSIS task will be using it for an in-memory JOIN, and I plan to include it as an ORDER BY within my OLE DB Source component.

Question Three: Does the clustered index belong on the Composite column, or the Year and ID columns? EDIT: Or, since I know that Year and ID won't affect the order of Composite, should I just include all three?


Solution

    1. Since you're turning these into a string column, the index is going to order them based on the characters in the string. You'll want the field (year or id) with the best distribution and most unique values to be first.
    2. You should index the composite column since it's being used for the join. Whether you want to include other data depends on what data you are pulling back.
    3. If you don't need the clustered index on any other columns, and the table is only being used for BI/data warehouse loading, you may as well put it on the composite column. That way a key lookup won't be required to get any other data you may need.