I have an ETL process which takes values from an input table which is a key value table with each row having a field ID and turning it into a more denormalized table where each row has all the values. Specifically, this is the input table:
StudentFieldValues (
FieldId INT NOT NULL,
StudentId INT NOT NULL,
Day DATE NOT NULL,
Value FLOAT NULL
)
FieldId
is a foreign key from table Field
, Day
is a foreign key from table Days
. The PK is the first 3 fields. There are currently 188 distinct fields. The output table is along the lines of:
StudentDays (
StudentId INT NOT NULL,
Day DATE NOT NULL,
NumberOfClasses FLOAT NULL,
MinutesLateToSchool FLOAT NULL,
... -- the rest of the 188 fields
)
The PK is the first 2 fields.
Currently the query that populates the output table does a self join with StudentFieldValues
188 times, one for each field. Each join equates StudentId
and Day
and takes a different FieldId
. Specifically:
SELECT Students.StudentId, Days.Day,
StudentFieldValues1.Value NumberOfClasses,
StudentFieldValues2.Value MinutesLateToSchool,
...
INTO StudentDays
FROM Students
CROSS JOIN Days
LEFT OUTER JOIN StudentFieldValues StudentFieldValues1
ON Students.StudentId=StudentFieldValues1.StudentId AND
Days.Day=StudentFieldValues1.Day AND
AND StudentFieldValues1.FieldId=1
LEFT OUTER JOIN StudentFieldValues StudentFieldValues2
ON Students.StudentId=StudentFieldValues2.StudentId AND
Days.Day=StudentFieldValues2.Day AND
StudentFieldValues2.FieldId=2
... -- 188 joins with StudentFieldValues table, one for each FieldId
I'm worried that this system isn't going to scale as more days, students and fields (especially fields) are added to the system. Already there are 188 joins and I keep reading that if you have a query with that number of joins you're doing something wrong. So I'm basically asking: Is this something that's gonna blow up in my face soon? Is there a better way to achieve what I'm trying to do? It's important to note that this query is minimally logged and that's something that wouldn't have been possible if I was adding the fields one after the other.
More details:
Think about doing this using conditional aggregation:
SELECT s.StudentId, d.Day,
max(case when sfv.FieldId = 1 then sfv.Value end) as NumberOfClasses,
max(case when sfv.FieldId = 2 then sfv.Value end) as MinutesLateToSchool,
...
INTO StudentDays
FROM Students s CROSS JOIN
Days d LEFT OUTER JOIN
StudentFieldValues sfv
ON s.StudentId = sfv.StudentId AND
d.Day = sfv.Day
GROUP BY s.StudentId, d.Day;
This has the advantage of easy scalability. You can add hundreds of fields and the processing time should be comparable (longer, but comparable) to fewer fields. It is also easer to add new fields.
EDIT:
A faster version of this query would use subqueries instead of aggregation:
SELECT s.StudentId, d.Day,
(SELECT TOP 1 sfv.Value FROM StudentFieldValues WHERE sfv.FieldId = 1 and sfv.StudentId = s.StudentId and sfv.Day = sfv.Day) as NumberOfClasses,
(SELECT TOP 1 sfv.Value FROM StudentFieldValues WHERE sfv.FieldId = 2 and sfv.StudentId = s.StudentId and sfv.Day = sfv.Day) as MinutesLateToSchool,
...
INTO StudentDays
FROM Students s CROSS JOIN
Days d;
For performance, you want a composite index on StudentFieldValues(StudentId, day, FieldId, Value)
.