Search code examples
sqlhadoopjoinhiveapache-hive

Hive / SQL - Left join with fallback


In Apache Hive I have to tables I would like to left-join keeping all the data from the left data and adding data where possible from the right table. For this I use two joins, because the join is based on two fields (a material_id and a location_id). This works fine with two traditional left joins:

SELECT 
   a.*, 
   b.*
FROM a
INNER JOIN (some more complex select) b
   ON a.material_id=b.material_id 
   AND a.location_id=b.location_id;

For the location_id the database only contains two distinct values, say 1 and 2.

We now have the requirement that if there is no "perfect match", this means that only the material_id can be joined and there is no correct combination of material_id and location_id (e.g. material_id=100 and location_id=1) for the join for the location_id in the b-table, the join should "default" or "fallback" to the other possible value of the location_id e.g. material_id=001 and location_id=2 and vice versa. This should only be the case for the location_id.

We have already looked into all possible answers also with CASE etc. but to no prevail. A setup like

...
ON a.material_id=b.material_id AND a.location_id=
CASE WHEN a.location_id = b.location_id THEN b.location_id ELSE ...;

we tried or did not figure out how really to do in hive query language.

Thank you for your help! Maybe somebody has a smart idea.

Here is some sample data:

Table a
| material_id | location_id | other_column_a |
| 100         | 1           | 45            |
| 101         | 1           | 45            |
| 103         | 1           | 45            |
| 103         | 2           | 45            |



Table b
| material_id | location_id | other_column_b |
| 100         | 1           | 66            |
| 102         | 1           | 76            |
| 103         | 2           | 88            |


Left - Join Table
| material_id | location_id | other_column_a | other_column_b
| 100         | 1           | 45            | 66
| 101         | 1           | 45            | NULL (mat. not in b)
| 103         | 1           | 45            | DEFAULT TO where location_id=2 (88)
| 103         | 2           | 45            | 88

PS: As stated here exists etc. does not work in the sub-query ON.


Solution

  • The solution is to left join without a.location_id = b.location_id and number all rows in order of preference. Then filter by row_number. In the code below the join will duplicate rows first because all matching material_id will be joined, then row_number() function will assign 1 to rows where a.location_id = b.location_id and 2 to rows where a.location_id <> b.location_id if exist also rows where a.location_id = b.location_id and 1 if there are not exist such. b.location_id added to the order by in the row_number() function so it will "prefer" rows with lower b.location_id in case there are no exact matching. I hope you have caught the idea.

    select * from 
    (
    SELECT 
       a.*, 
       b.*,
       row_number() over(partition by material_id 
                         order by CASE WHEN a.location_id = b.location_id THEN 1 ELSE 2 END, b.location_id ) as rn
    FROM a
    LEFT JOIN (some more complex select) b
       ON a.material_id=b.material_id 
    )s 
    where rn=1
    ;