Search code examples
sql-servert-sqlgroup-bymax

How to join two tables where you only want the record with the max date from one side of the join?


I am trying to write a query to return currently unfilled training requirements.

The 4 main tables involved are:

  • TrainingRequirements table: Stores a CourseId and UserId. The user and course required to be completed.
  • UserTraining table: Stores a CourseId, UserId, and when the training was completed.
  • User table: The user who took the training.
  • Course table: The course for the training taken. Stores how many months a training is valid for.

Here is what I have so far:

SELECT * 
FROM TrainingRequirements
    INNER JOIN Users on Users.Id = TrainingRequirements.UserId
    INNER JOIN Courses on Courses.Id = TrainingRequirements.CourseId
    LEFT OUTER JOIN UserTrainings on UserTrainings .CourseId = TrainReq.CourseId 
    AND UserTrainings .UserId = TrainingRequirements.UserId
WHERE 
    Users.Active = 1 AND 
    TrainingRequirements.Active = 1 AND 
    (UserTrainings.Id is null OR 
       --covers if they never took the training at all
    GETDATE() >= DATEADD(month, Courses.MonthsValid, UserTrainings.DateCompleted)) 
       --covers if they took the training but it has expired and must be retaken

This almost works. It returns me instances where they never took the training (so UserTraining.Id is null because no matching record exists on the right side of the left outer join), but it also returns to me a bunch of older completed trainings that are no longer applicable.

If a training is required annually, the user training would accumulate multiple records over the years. I only want to compare it to the most recent completion, that is the highest DateCompleted per user per training, so I assume I need to use some sort of MAX() or Group By or subquery in the join, but I can't figure it out.

Sample Data: TrainingRequirements

Id | UserId  | CourseId | active
1  | 1       |1          | 1
2  | 1       |2          | 1
3  | 1       |3          | 0 
4  | 1       |4          | 1

UserTrainings

Id | UserId | CourseId | DateCompleted
1  |1       |1         | 2022-10-23 12:20:11.526
2  |1       |2         | 2021-11-15 05:01:12.320
3  |1       |2         | 2020-12-11 11:11:40.320

Courses

Id | Name                  | MonthsValid
1  | 'Cool Training'       | 36
2  | 'Secret Training'     | 12
3  | 'Top Secret Training' | 6
4  | 'Unpopular Training'  | 12

In the above, Requirement 3 is not active so can be ignored, Requirement 1 has a completion on 10/23/2023 and is valid for 36 months, so it's not expired and can also be ignored. 2 and 4 should be returned. 4 was never completed, and 2 was completed but since the expiration date (date of previous completion + months valid)

Desired Results

TrainingRequirementId | UserId | CourseId | LastTaken | Expired
2                     | 1      | 2        | 11/15/2021| 11/15/2022
4                     | 1      | 4        | null      | null

Using my query returns an additional row because the UserTraining with ID 3 is also matched. But since that is older an older completion I want to exclude it.


Solution

  • So without actual data to run this against, I'm kind of fumbling in the dark (i.e. I can't verify this actually does what you want) but it sounds like you want to get users who need to take a course, or took it, but over n months ago.

    To do this, you need the first row for each combination of UserID and CourseID when ordered by DateCompleted. The function row_number is a window function which helps with this. It adds a series of numbers (i.e. 1, 2, 3, etc.) over each "partition" (e.g. in your case UserID, CourseId) ordered by whatever you choose (in this case, DateCompleted desc) and since you order it by DateCompleted desc the row with the value of 1 will always be the most recenty for that partition.

    You should be able to run something like this to see what that looks like:

    select
        RID = row_number() over (partition by UserID, CourseId order by DateCompleted desc),
        CourseId,
        UserID,
        DateCompleted
    from UserTrainings
    

    From there, it's back to what you were doing before, except instead of joining to the actual UserTrainings table, you join to the CTE you created with the row_number over it, and you add the additional join predicate where u.RID = 1 to only grab the rows out of the CTE where DateCompleted is the most recent.

    Here's what I came up with. Again, without actual data, you may need to futz with this a bit, but I think this should get you on the right track.

    ;with ut as
    (
        select
            RID = row_number() over (partition by UserID, CourseId order by DateCompleted desc),
            CourseId,
            UserID,
            DateCompleted
        from UserTrainings
    )
    select *
    from TrainingRequirements tr
    inner join users u
        on tr.UserID = u.UserID
    inner join courses c
        on tr.CourseId = c.CourseID
    left outer join ut u
        on r.CourseId = u.CourseId
            and r.UserID = u.UserId
            and u.RID = 1 -- This gets the most recent record for each user/course in UserTrainings, sorted by DateCompleted
    where u.CourseID is null -- they haven't taken it
        or ut.DateCompleted < dateadd(month, r.MonthsValid, getdate())