Search code examples
hivehiveqlcross-join

Cross join remaining combinations


I am trying to build a table that would bring be a combination of all products that I could sell, based on the current ones.

Product Status Table

+-------------+--------------+----------------+
| customer_id | product_name | product_status |
+-------------+--------------+----------------+
|           1 | A            | Active         |
|           2 | B            | Active         |
|           2 | C            | Active         |
|           3 | A            | Cancelled      |
+-------------+--------------+----------------+

Now I am trying to cross join with a hard code table that would give be 4 rows per customer_id, based on all 4 product we have in our portfolio, and statuses that I would like to apply.

Portfolio Table


+--------------+------------+----------+
| product_name |  status_1  | status_2 |
+--------------+------------+----------+
| A            | Inelegible | Inactive |
| B            | Inelegible | Inactive |
| C            | Ineligible | Inactive |
| D            | Inelegible | Inactive |
+--------------+------------+----------+


On my code I tried to use a CROSS JOIN in order to achieve 4 rows per customer_id. Unfortunately, for customers that have more than one product, I have double/triple rows.

This is my code:

SELECT
    p.customer_id,
    CASE WHEN p.product_name = pt.product_name THEN p.product_name ELSE pt.product_name END AS product_name,
    CASE 
        WHEN p.product_name = pt.product_name THEN p.product_status 
        ELSE pt.status_1
    END AS product_status
FROM 
    products AS p
CROSS JOIN
    portfolio as pt

This is my current output:


+----+-------------+--------------+----------------+
| #  | customer_id | product_name | product_status |
+----+-------------+--------------+----------------+
|  1 |           1 | A            | Active         |
|  2 |           1 | B            | Inelegible     |
|  3 |           1 | C            | Inelegible     |
|  4 |           1 | D            | Inelegible     |
|  5 |           2 | A            | Ineligible     |
|  6 |           2 | A            | Ineligible     |
|  7 |           2 | B            | Active         |
|  8 |           2 | B            | Ineligible     |
|  9 |           2 | C            | Active         |
| 10 |           2 | C            | Ineligible     |
| 11 |           2 | D            | Ineligible     |
| 12 |           2 | D            | Ineligible     |
| 13 |           3 | A            | Cancelled      |
| 14 |           3 | B            | Ineligible     |
| 15 |           3 | C            | Ineligible     |
| 16 |           3 | D            | Ineligible     |
+----+-------------+--------------+----------------+

As you may see, for the customer_id 2, I have two rows for each product having products B and C with different statuses then what I have on the product_status table.

What I would like to achieve, in this case, is a table with 12 rows, in which the current product/status from the product_status table is shown, and the remaining product/statuses from the portfolio table are added.

Expected output


+----+-------------+--------------+----------------+
| #  | customer_id | product_name | product_status |
+----+-------------+--------------+----------------+
|  1 |           1 | A            | Active         |
|  2 |           1 | B            | Inelegible     |
|  3 |           1 | C            | Inelegible     |
|  4 |           1 | D            | Inelegible     |
|  5 |           2 | A            | Ineligible     |
|  6 |           2 | B            | Active         |
|  7 |           2 | C            | Active         |
|  8 |           2 | D            | Ineligible     |
|  9 |           3 | A            | Cancelled      |
| 10 |           3 | B            | Ineligible     |
| 11 |           3 | C            | Ineligible     |
| 12 |           3 | D            | Ineligible     |
+----+-------------+--------------+----------------+

Not sure if the CROSS JOIN is the best alternative, but now I am running out of ideas.


Solution

  • EDIT:

    I thought of another cleaner solution. Do a cross join first, then a right join on the customer_id and product_name, and coalesce the product statuses.

    SELECT customer_id, product_name, coalesce(product_status, status_1)
    FROM products p
    RIGHT JOIN (
        SELECT * 
        FROM (SELECT DISTINCT customer_id FROM products) pro
        CROSS JOIN portfolio
    ) pt
    USING (customer_id, product_name)
    ORDER BY customer_id, product_name
    

    Old answer: The idea is to include information of all product names for a customer_id into a list, and check whether the product in portfolio is in that list.

    (SELECT customer_id, pt_product_name as product_name, first(status_1) as product_status
    FROM (
        SELECT
            customer_id,
            p.product_name as p_product_name,
            pt.product_name as pt_product_name,
            product_status,
            status_1,
            status_2,
            collect_list(p.product_name) over (partition by customer_id) AS product_list
        FROM products p
        CROSS JOIN portfolio pt
        )
    WHERE NOT array_contains(product_list, pt_product_name)
    GROUP BY customer_id, product_name)
    
    UNION ALL
    
    (SELECT customer_id, p_product_name as product_name, first(product_status) as product_status
    FROM (
        SELECT
            customer_id,
            p.product_name as p_product_name,
            pt.product_name as pt_product_name,
            product_status,
            status_1,
            status_2,
            collect_list(p.product_name) over (partition by customer_id) AS product_list 
        FROM products p
        CROSS JOIN portfolio pt)
    WHERE array_contains(product_list, pt_product_name)
    GROUP BY customer_id, product_name)
    
    ORDER BY customer_id, product_name;
    

    which gives

    +-----------+------------+--------------+
    |customer_id|product_name|product_status|
    +-----------+------------+--------------+
    |          1|           A|        Active|
    |          1|           B|    Inelegible|
    |          1|           C|    Ineligible|
    |          1|           D|    Inelegible|
    |          2|           A|    Inelegible|
    |          2|           B|        Active|
    |          2|           C|        Active|
    |          2|           D|    Inelegible|
    |          3|           A|     Cancelled|
    |          3|           B|    Inelegible|
    |          3|           C|    Ineligible|
    |          3|           D|    Inelegible|
    +-----------+------------+--------------+
    

    FYI the chunk before UNION ALL gives:

    +-----------+------------+--------------+
    |customer_id|product_name|product_status|
    +-----------+------------+--------------+
    |          1|           B|    Inelegible|
    |          1|           C|    Ineligible|
    |          1|           D|    Inelegible|
    |          2|           A|    Inelegible|
    |          2|           D|    Inelegible|
    |          3|           B|    Inelegible|
    |          3|           C|    Ineligible|
    |          3|           D|    Inelegible|
    +-----------+------------+--------------+
    

    And the chunk after UNION ALL gives:

    +-----------+------------+--------------+
    |customer_id|product_name|product_status|
    +-----------+------------+--------------+
    |          1|           A|        Active|
    |          2|           B|        Active|
    |          2|           C|        Active|
    |          3|           A|     Cancelled|
    +-----------+------------+--------------+
    

    Hope that helps!