I have two tables, table Y, with composite Primary key OmgevingId, AdministratieKantoorID, WerkgeverID, DienstverbandID, VerloningsNummer
:
OmgevingID | KantoorID | WerkgeverID | DienstverbandID | VerloningsNummer | Period |
---|---|---|---|---|---|
28 | 1 | 19 | 243 | 1 | 2023-01-01 |
28 | 1 | 19 | 244 | 2 | 2023-02-01 |
28 | 1 | 19 | 245 | 3 | 2023-03-01 |
28 | 1 | 19 | 243 | 4 | 2023-01-01 |
28 | 1 | 19 | 244 | 5 | 2023-02-01 |
28 | 1 | 19 | 245 | 6 | 2023-03-01 |
and table X, with composite Primary Key OmgevingId, AdministratieKantoorID, WerkgeverID, DienstverbandID, WerktijdID
:
OmgevingID | KantoorID | WerkgeverID | DienstverbandID | WerktijdID | StartDate | EndDate | FieldINeed |
---|---|---|---|---|---|---|---|
28 | 1 | 19 | 243 | 1 | 2019-12-28 | 2021-02-02 | A |
28 | 1 | 19 | 243 | 2 | 2021-02-03 | 2023-01-31 | B |
28 | 1 | 19 | 243 | 3 | 2023-02-01 | 2023-03-15 | C |
28 | 1 | 19 | 244 | 1 | 2019-01-31 | NULL | D |
28 | 1 | 19 | 245 | 1 | 2018-12-28 | 2020-02-02 | E |
28 | 1 | 19 | 245 | 2 | 2020-02-03 | 2022-01-31 | F |
28 | 1 | 19 | 245 | 3 | 2022-02-01 | NULL | G |
I want to LEFT JOIN
table X into table Y. However, this leads to problems as there is a m:n
relationship, and a LEFT JOIN
will make the resulting table explode. As you can see, the tables have mutual keys: OmgevingID, AdministratieKantoorID, WerkgeverID and DienstverbandID
. If I could somehow make it so that these keys uniquely identify a row in table X, my problem would be solved as there would be a m:1
relationship between Y and X, and I could therefore safely join X into Y.
I could do this by for example using a GROUP BY
on the mutual keys in table X, and taking some aggregate of WerktijdID and the rows behind WerktijdID. I could then INNER JOIN
this table on table X, to get a table with only one 'werktijdID' per combination of OmgevingID, AdministratieKantoorID, WerkgeverID and DienstverbandID
, thus making the resulting Primary Key of this table OmgevingID, AdministratieKantoorID, WerkgeverID, DienstverbandID
. Then I could safely join X into Y. The following code snippet shows how such a query would look (For simplicity I only included keys in the query but similar logic could be applied to other data fields):
SELECT x.*, wt.* FROM
LEFT JOIN (
SELECT OmgevingID, AdministratieKantoorID, WerkgeverID, DienstverbandID, WerktijdID
FROM y
INNER JOIN (
SELECT OmgevingID, AdministratieKantoorID, WerkgeverID, PersoonID, DienstverbandID, MAX(WerktijdID) as werktijdID
FROM Y
GROUP BY OmgevingID, AdministratieKantoorID, WerkgeverID, DienstverbandID) maxwt
ON y.OmgevingID = maxwt.OmgevingID
AND y.AdministratieKantoorID = maxwt.AdministratieKantoorID
AND y.WerkgeverID = maxwt.WerkgeverID
AND y.DienstverbandID = maxwt.DienstverbandID
AND y.WerktijdID = maxwt.WerktijdID ) as wt
ON x.OmgevingID = wt.OmgevingID
AND x.AdministratieKantoorID = wt.AdministratieKantoorID
AND x.WerkgeverID = wt.WerkgeverID
AND x.DienstverbandID = wt.DienstverbandID
The results of such a query would look like this:
OmgevingID | KantoorID | WerkgeverID | DienstverbandID | VeloningsNummer | Period | WerktijdID | StartDate | EndDate | FieldINeed |
---|---|---|---|---|---|---|---|---|---|
28 | 1 | 19 | 243 | 1 | 2023-01-01 | 3 | 2023-02-01 | 2023-03-15 | C |
28 | 1 | 19 | 244 | 2 | 2023-01-01 | 1 | 2019-01-31 | NULL | D |
28 | 1 | 19 | 245 | 3 | 2023-03-01 | 3 | 2022-02-01 | NULL | G |
28 | 1 | 19 | 243 | 4 | 2023-01-01 | 3 | 2023-02-01 | 2023-03-15 | C |
28 | 1 | 19 | 244 | 5 | 2023-01-01 | 1 | 2019-01-31 | NULL | D |
28 | 1 | 19 | 245 | 6 | 2023-03-01 | 3 | 2022-02-01 | NULL | G |
But that would be incorrect, as I don't want to have the MAX(WerktijdID)
, instead I want the WerktijdID which is "active" in a certain period. Only one WerktijdID can be active at a given time in table X. To find which WerktijdID is "active" in a certain Period, I can use the following rule: StartDate <= Period AND (EndDate >= Period OR EndDate is NULL)
The (correct) expected result of the query would look as follows:
OmgevingID | KantoorID | WerkgeverID | DienstverbandID | VeloningsNummer | Period | WerktijdID | StartDate | EndDate | FieldINeed |
---|---|---|---|---|---|---|---|---|---|
28 | 1 | 19 | 243 | 1 | 2023-01-01 | 2 | 2021-02-03 | 2023-01-31 | B |
28 | 1 | 19 | 244 | 2 | 2023-01-01 | 1 | 2019-01-31 | NULL | D |
28 | 1 | 19 | 245 | 3 | 2023-03-01 | 3 | 2022-02-01 | NULL | G |
28 | 1 | 19 | 243 | 4 | 2023-01-01 | 2 | 2021-02-03 | 2023-01-31 | B |
28 | 1 | 19 | 244 | 5 | 2023-01-01 | 1 | 2019-01-31 | NULL | D |
28 | 1 | 19 | 245 | 6 | 2023-03-01 | 3 | 2022-02-01 | NULL | G |
However, when I try this in SQL like so:
SELECT x.*, y.* FROM y
LEFT JOIN (
SELECT * FROM x
WHERE StartDate <= y.Period AND (EndDate >= y.Period OR EndDate is NULL) as x_alias
ON y.OmgevingID = x_alias.OmgevingID
AND y.AdministratieKantoorID = x_alias.AdministratieKantoorID
AND y.WerkgeverID = x_alias.WerkgeverID
AND y.DienstverbandID = x_alias.DienstverbandID
I get the following error: "Error: Column 'x_alias.Period' does not exist"
. I think this is because I cannot refer to y.period inside the subquery? Any tips on how to achieve my desired result would be greatly appreciated. I am using Spark-SQL, so a solution in PySpark would also be acceptable.
This is because you can't refer to the outer table (y) in a subquery,
Try this :
SELECT x.*, y.*
FROM y
LEFT JOIN x
ON y.OmgevingID = x.OmgevingID
AND y.AdministratieKantoorID = x.AdministratieKantoorID
AND y.WerkgeverID = x.WerkgeverID
AND y.DienstverbandID = x.DienstverbandID
AND x.StartDate <= y.Period
AND (x.EndDate >= y.Period OR x.EndDate IS NULL)