sql-server indexing sql-server-2008-r2 large-data

SQL Server index design where PK would be same size as row data

I am trying to confirm that my table needs a primary key, even though it would double the row size, or figure out what an appropriate indexing strategy would be. We are using SQL Server 2008 R2.

I have a Testscores table with just over 2 billion rows, and each row only contains 10 bytes of data of the following form:

(ItemID INT, ProjectID SMALLINT, DepartmentID SMALLINT, Score REAL).

No column is unique, but we have approximately 100 million ItemIDs, 500 ProjectIDs and 300 DepartmentIDs.

I have a lookup table of Projects with ~500 rows in the following form

(ID SMALLINT, ProjectName varchar, State Char(2), year INT)

Originally this table was denormalized and approximately 600gb. My goal is to be able to query the projects table on either ProjectName, State, or year (sometimes one of those, sometimes two, sometimes all three). I would then join the Testscores table on ProjectsID to return all test scores from matching projects (somewhere between 5 million and 20 million results)

After rebuilding the tables (stupid should have figured this out first), I come to learn that without a clustered index, every query will have to use a table scan even if I build a nonclustered index on ProjectsID.

My current row size is 10 bytes, and adding a BigInt (needed, already at 2 billion and adding more) would add 8 bytes to each row, essentially doubling my database. Building a nonclustered index on ProjectsID would essentially require 8 bytes for the uniqueifier (4 for the value, 4 because its the first varchar).

Any ideas? Did I screw something up in my database design? I don't mind rebuilding it again, I just want to do it right.

PS, I've haunted for about a decade, and this is the first question I've had that I couldn't answer through searches. You all rock!

Edit: When I loaded the data into the table, it was presorted on ProjectID ASC, ItemID ASC, if that makes any difference.

Solution

With a record size of 8 bytes per record, SQL Server is putting about 1,000 rows on each page. That means that any query that selects more than 0.1% of the data is quite likely to be hitting all or almost all pages. Under these circumstances, the engine generally opts for a full table scan rather than using an index.

Given that your queries are returning at least 5 million rows, I speculate that it would be hard to avoid a full table scan. A clustered index might help for some queries (through some miracles, perhaps), but not for all.

One thing that might help is partitioning the table; however, you would need to denormalize the data for effective partitioning.