Search code examples
sqlpysparkleft-join

How can I join two tables with different keys based on date values?


I have two tables, table Y, with composite Primary key OmgevingId, AdministratieKantoorID, WerkgeverID, DienstverbandID, VerloningsNummer:

OmgevingID KantoorID WerkgeverID DienstverbandID VerloningsNummer Period
28 1 19 243 1 2023-01-01
28 1 19 244 2 2023-02-01
28 1 19 245 3 2023-03-01
28 1 19 243 4 2023-01-01
28 1 19 244 5 2023-02-01
28 1 19 245 6 2023-03-01

and table X, with composite Primary Key OmgevingId, AdministratieKantoorID, WerkgeverID, DienstverbandID, WerktijdID:

OmgevingID KantoorID WerkgeverID DienstverbandID WerktijdID StartDate EndDate FieldINeed
28 1 19 243 1 2019-12-28 2021-02-02 A
28 1 19 243 2 2021-02-03 2023-01-31 B
28 1 19 243 3 2023-02-01 2023-03-15 C
28 1 19 244 1 2019-01-31 NULL D
28 1 19 245 1 2018-12-28 2020-02-02 E
28 1 19 245 2 2020-02-03 2022-01-31 F
28 1 19 245 3 2022-02-01 NULL G

I want to LEFT JOIN table X into table Y. However, this leads to problems as there is a m:n relationship, and a LEFT JOIN will make the resulting table explode. As you can see, the tables have mutual keys: OmgevingID, AdministratieKantoorID, WerkgeverID and DienstverbandID. If I could somehow make it so that these keys uniquely identify a row in table X, my problem would be solved as there would be a m:1 relationship between Y and X, and I could therefore safely join X into Y.

I could do this by for example using a GROUP BY on the mutual keys in table X, and taking some aggregate of WerktijdID and the rows behind WerktijdID. I could then INNER JOIN this table on table X, to get a table with only one 'werktijdID' per combination of OmgevingID, AdministratieKantoorID, WerkgeverID and DienstverbandID, thus making the resulting Primary Key of this table OmgevingID, AdministratieKantoorID, WerkgeverID, DienstverbandID. Then I could safely join X into Y. The following code snippet shows how such a query would look (For simplicity I only included keys in the query but similar logic could be applied to other data fields):

SELECT x.*, wt.* FROM
LEFT JOIN (
    SELECT OmgevingID, AdministratieKantoorID, WerkgeverID, DienstverbandID, WerktijdID
    FROM y
    INNER JOIN (
           SELECT OmgevingID, AdministratieKantoorID, WerkgeverID, PersoonID, DienstverbandID, MAX(WerktijdID) as werktijdID
           FROM Y
           GROUP BY OmgevingID, AdministratieKantoorID, WerkgeverID, DienstverbandID) maxwt
    ON y.OmgevingID = maxwt.OmgevingID
        AND y.AdministratieKantoorID = maxwt.AdministratieKantoorID
        AND y.WerkgeverID = maxwt.WerkgeverID
        AND y.DienstverbandID = maxwt.DienstverbandID
        AND y.WerktijdID = maxwt.WerktijdID ) as wt
ON x.OmgevingID = wt.OmgevingID
AND x.AdministratieKantoorID = wt.AdministratieKantoorID
AND x.WerkgeverID = wt.WerkgeverID
AND x.DienstverbandID = wt.DienstverbandID

The results of such a query would look like this:

OmgevingID KantoorID WerkgeverID DienstverbandID VeloningsNummer Period WerktijdID StartDate EndDate FieldINeed
28 1 19 243 1 2023-01-01 3 2023-02-01 2023-03-15 C
28 1 19 244 2 2023-01-01 1 2019-01-31 NULL D
28 1 19 245 3 2023-03-01 3 2022-02-01 NULL G
28 1 19 243 4 2023-01-01 3 2023-02-01 2023-03-15 C
28 1 19 244 5 2023-01-01 1 2019-01-31 NULL D
28 1 19 245 6 2023-03-01 3 2022-02-01 NULL G

But that would be incorrect, as I don't want to have the MAX(WerktijdID), instead I want the WerktijdID which is "active" in a certain period. Only one WerktijdID can be active at a given time in table X. To find which WerktijdID is "active" in a certain Period, I can use the following rule: StartDate <= Period AND (EndDate >= Period OR EndDate is NULL)

The (correct) expected result of the query would look as follows:

OmgevingID KantoorID WerkgeverID DienstverbandID VeloningsNummer Period WerktijdID StartDate EndDate FieldINeed
28 1 19 243 1 2023-01-01 2 2021-02-03 2023-01-31 B
28 1 19 244 2 2023-01-01 1 2019-01-31 NULL D
28 1 19 245 3 2023-03-01 3 2022-02-01 NULL G
28 1 19 243 4 2023-01-01 2 2021-02-03 2023-01-31 B
28 1 19 244 5 2023-01-01 1 2019-01-31 NULL D
28 1 19 245 6 2023-03-01 3 2022-02-01 NULL G

However, when I try this in SQL like so:

SELECT x.*, y.* FROM y
LEFT JOIN (
    SELECT * FROM x
    WHERE StartDate <= y.Period AND (EndDate >= y.Period OR EndDate is NULL) as x_alias 
ON y.OmgevingID = x_alias.OmgevingID 
AND y.AdministratieKantoorID = x_alias.AdministratieKantoorID
AND y.WerkgeverID = x_alias.WerkgeverID 
AND y.DienstverbandID = x_alias.DienstverbandID

I get the following error: "Error: Column 'x_alias.Period' does not exist". I think this is because I cannot refer to y.period inside the subquery? Any tips on how to achieve my desired result would be greatly appreciated. I am using Spark-SQL, so a solution in PySpark would also be acceptable.


Solution

  • This is because you can't refer to the outer table (y) in a subquery,

    Try this :

    SELECT x.*, y.* 
    FROM y
    LEFT JOIN x
    ON y.OmgevingID = x.OmgevingID 
    AND y.AdministratieKantoorID = x.AdministratieKantoorID
    AND y.WerkgeverID = x.WerkgeverID 
    AND y.DienstverbandID = x.DienstverbandID
    AND x.StartDate <= y.Period 
    AND (x.EndDate >= y.Period OR x.EndDate IS NULL)