How can I join two tables with different keys based on date values?

I have two tables, table Y, with composite Primary key OmgevingId, AdministratieKantoorID, WerkgeverID, DienstverbandID, VerloningsNummer:

OmgevingID	KantoorID	WerkgeverID	DienstverbandID	VerloningsNummer	Period
28	1	19	243	1	2023-01-01
28	1	19	244	2	2023-02-01
28	1	19	245	3	2023-03-01
28	1	19	243	4	2023-01-01
28	1	19	244	5	2023-02-01
28	1	19	245	6	2023-03-01

and table X, with composite Primary Key OmgevingId, AdministratieKantoorID, WerkgeverID, DienstverbandID, WerktijdID:

OmgevingID	KantoorID	WerkgeverID	DienstverbandID	WerktijdID	StartDate	EndDate	FieldINeed
28	1	19	243	1	2019-12-28	2021-02-02	A
28	1	19	243	2	2021-02-03	2023-01-31	B
28	1	19	243	3	2023-02-01	2023-03-15	C
28	1	19	244	1	2019-01-31	NULL	D
28	1	19	245	1	2018-12-28	2020-02-02	E
28	1	19	245	2	2020-02-03	2022-01-31	F
28	1	19	245	3	2022-02-01	NULL	G

I want to LEFT JOIN table X into table Y. However, this leads to problems as there is a m:n relationship, and a LEFT JOIN will make the resulting table explode. As you can see, the tables have mutual keys: OmgevingID, AdministratieKantoorID, WerkgeverID and DienstverbandID. If I could somehow make it so that these keys uniquely identify a row in table X, my problem would be solved as there would be a m:1 relationship between Y and X, and I could therefore safely join X into Y.

I could do this by for example using a GROUP BY on the mutual keys in table X, and taking some aggregate of WerktijdID and the rows behind WerktijdID. I could then INNER JOIN this table on table X, to get a table with only one 'werktijdID' per combination of OmgevingID, AdministratieKantoorID, WerkgeverID and DienstverbandID, thus making the resulting Primary Key of this table OmgevingID, AdministratieKantoorID, WerkgeverID, DienstverbandID. Then I could safely join X into Y. The following code snippet shows how such a query would look (For simplicity I only included keys in the query but similar logic could be applied to other data fields):

SELECT x.*, wt.* FROM
LEFT JOIN (
    SELECT OmgevingID, AdministratieKantoorID, WerkgeverID, DienstverbandID, WerktijdID
    FROM y
    INNER JOIN (
           SELECT OmgevingID, AdministratieKantoorID, WerkgeverID, PersoonID, DienstverbandID, MAX(WerktijdID) as werktijdID
           FROM Y
           GROUP BY OmgevingID, AdministratieKantoorID, WerkgeverID, DienstverbandID) maxwt
    ON y.OmgevingID = maxwt.OmgevingID
        AND y.AdministratieKantoorID = maxwt.AdministratieKantoorID
        AND y.WerkgeverID = maxwt.WerkgeverID
        AND y.DienstverbandID = maxwt.DienstverbandID
        AND y.WerktijdID = maxwt.WerktijdID ) as wt
ON x.OmgevingID = wt.OmgevingID
AND x.AdministratieKantoorID = wt.AdministratieKantoorID
AND x.WerkgeverID = wt.WerkgeverID
AND x.DienstverbandID = wt.DienstverbandID

The results of such a query would look like this:

OmgevingID	KantoorID	WerkgeverID	DienstverbandID	VeloningsNummer	Period	WerktijdID	StartDate	EndDate	FieldINeed
28	1	19	243	1	2023-01-01	3	2023-02-01	2023-03-15	C
28	1	19	244	2	2023-01-01	1	2019-01-31	NULL	D
28	1	19	245	3	2023-03-01	3	2022-02-01	NULL	G
28	1	19	243	4	2023-01-01	3	2023-02-01	2023-03-15	C
28	1	19	244	5	2023-01-01	1	2019-01-31	NULL	D
28	1	19	245	6	2023-03-01	3	2022-02-01	NULL	G

But that would be incorrect, as I don't want to have the MAX(WerktijdID), instead I want the WerktijdID which is "active" in a certain period. Only one WerktijdID can be active at a given time in table X. To find which WerktijdID is "active" in a certain Period, I can use the following rule: StartDate <= Period AND (EndDate >= Period OR EndDate is NULL)

The (correct) expected result of the query would look as follows:

OmgevingID	KantoorID	WerkgeverID	DienstverbandID	VeloningsNummer	Period	WerktijdID	StartDate	EndDate	FieldINeed
28	1	19	243	1	2023-01-01	2	2021-02-03	2023-01-31	B
28	1	19	244	2	2023-01-01	1	2019-01-31	NULL	D
28	1	19	245	3	2023-03-01	3	2022-02-01	NULL	G
28	1	19	243	4	2023-01-01	2	2021-02-03	2023-01-31	B
28	1	19	244	5	2023-01-01	1	2019-01-31	NULL	D
28	1	19	245	6	2023-03-01	3	2022-02-01	NULL	G

However, when I try this in SQL like so:

SELECT x.*, y.* FROM y
LEFT JOIN (
    SELECT * FROM x
    WHERE StartDate <= y.Period AND (EndDate >= y.Period OR EndDate is NULL) as x_alias 
ON y.OmgevingID = x_alias.OmgevingID 
AND y.AdministratieKantoorID = x_alias.AdministratieKantoorID
AND y.WerkgeverID = x_alias.WerkgeverID 
AND y.DienstverbandID = x_alias.DienstverbandID

I get the following error: "Error: Column 'x_alias.Period' does not exist". I think this is because I cannot refer to y.period inside the subquery? Any tips on how to achieve my desired result would be greatly appreciated. I am using Spark-SQL, so a solution in PySpark would also be acceptable.

Solution

This is because you can't refer to the outer table (y) in a subquery,

Try this :

SELECT x.*, y.* 
FROM y
LEFT JOIN x
ON y.OmgevingID = x.OmgevingID 
AND y.AdministratieKantoorID = x.AdministratieKantoorID
AND y.WerkgeverID = x.WerkgeverID 
AND y.DienstverbandID = x.DienstverbandID
AND x.StartDate <= y.Period 
AND (x.EndDate >= y.Period OR x.EndDate IS NULL)