Search code examples
sqlsql-serverjoinetlolap

Creating a denormalized table from a normalized key-value table using 100s of joins


I have an ETL process which takes values from an input table which is a key value table with each row having a field ID and turning it into a more denormalized table where each row has all the values. Specifically, this is the input table:

StudentFieldValues (
    FieldId INT NOT NULL,
    StudentId INT NOT NULL,
    Day DATE NOT NULL,
    Value FLOAT NULL
)

FieldId is a foreign key from table Field, Day is a foreign key from table Days. The PK is the first 3 fields. There are currently 188 distinct fields. The output table is along the lines of:

StudentDays (
    StudentId INT NOT NULL,
    Day DATE NOT NULL,
    NumberOfClasses FLOAT NULL,
    MinutesLateToSchool FLOAT NULL,
    ... -- the rest of the 188 fields
)

The PK is the first 2 fields.

Currently the query that populates the output table does a self join with StudentFieldValues 188 times, one for each field. Each join equates StudentId and Day and takes a different FieldId. Specifically:

SELECT Students.StudentId, Days.Day, 
       StudentFieldValues1.Value NumberOfClasses, 
       StudentFieldValues2.Value MinutesLateToSchool,
       ...
INTO StudentDays
FROM Students
CROSS JOIN Days
LEFT OUTER JOIN StudentFieldValues StudentFieldValues1 
ON Students.StudentId=StudentFieldValues1.StudentId AND 
   Days.Day=StudentFieldValues1.Day AND
   AND StudentFieldValues1.FieldId=1
LEFT OUTER JOIN StudentFieldValues StudentFieldValues2 
ON Students.StudentId=StudentFieldValues2.StudentId AND 
   Days.Day=StudentFieldValues2.Day AND 
   StudentFieldValues2.FieldId=2
... -- 188 joins with StudentFieldValues table, one for each FieldId

I'm worried that this system isn't going to scale as more days, students and fields (especially fields) are added to the system. Already there are 188 joins and I keep reading that if you have a query with that number of joins you're doing something wrong. So I'm basically asking: Is this something that's gonna blow up in my face soon? Is there a better way to achieve what I'm trying to do? It's important to note that this query is minimally logged and that's something that wouldn't have been possible if I was adding the fields one after the other.

More details:

  • MS SQL Server 2014, 2x XEON E5 2690v2 (20 cores, 40 threads total), 128GB RAM. Windows 2008R2.
  • 352 million rows in the input table, 18 million rows in the output table - both expected to increase over time.
  • Query takes 20 minutes and I'm very happy with that, but performance degrades as I add more fields.

Solution

  • Think about doing this using conditional aggregation:

    SELECT s.StudentId, d.Day, 
           max(case when sfv.FieldId = 1 then sfv.Value end) as NumberOfClasses, 
           max(case when sfv.FieldId = 2 then sfv.Value end) as MinutesLateToSchool,
           ...
    INTO StudentDays
    FROM Students s CROSS JOIN
         Days d LEFT OUTER JOIN
         StudentFieldValues sfv 
         ON s.StudentId = sfv.StudentId AND 
            d.Day = sfv.Day 
    GROUP BY s.StudentId, d.Day;
    

    This has the advantage of easy scalability. You can add hundreds of fields and the processing time should be comparable (longer, but comparable) to fewer fields. It is also easer to add new fields.

    EDIT:

    A faster version of this query would use subqueries instead of aggregation:

    SELECT s.StudentId, d.Day, 
           (SELECT TOP 1 sfv.Value FROM StudentFieldValues WHERE sfv.FieldId = 1 and sfv.StudentId = s.StudentId and sfv.Day = sfv.Day) as NumberOfClasses, 
            (SELECT TOP 1 sfv.Value FROM StudentFieldValues WHERE sfv.FieldId = 2 and sfv.StudentId = s.StudentId and sfv.Day = sfv.Day) as MinutesLateToSchool,
           ...
    INTO StudentDays
    FROM Students s CROSS JOIN
         Days d;
    

    For performance, you want a composite index on StudentFieldValues(StudentId, day, FieldId, Value).