Search code examples
sqljoinhivehiveqlinformix

Informix to HiveQL


I have a SQL query written for Informix

SELECT cols
FROM table1 t1, outer(table2 t2, table3 t3)
WHERE t1.id = t2.id and t2.type = t3.type

We have the same tables and data on Hive and I want to translate it to HiveQL


Solution

  • Informix-style outer joins are quite interesting, and unique, and totally non-standard. Informix's implementation of ANSI (ISO) standard SQL outer joins should be totally boring and the same as other DBMS.

    Here is an SQL script that creates and populates 3 tables and executes 5 different queries against those tables.

    CREATE TABLE table1
    (
        id      SERIAL NOT NULL PRIMARY KEY,
        data    VARCHAR(32) NOT NULL
    );
    
    CREATE TABLE table2
    (
        id      INTEGER NOT NULL,
        type    INTEGER NOT NULL,
        info    VARCHAR(32) NOT NULL,
        PRIMARY KEY(id, type)
    );
    
    CREATE TABLE table3
    (
        type    INTEGER NOT NULL PRIMARY KEY,
        name    VARCHAR(32) NOT NULL
    );
    
    INSERT INTO table1 VALUES(100, 'Table 1 - ID 100');
    INSERT INTO table1 VALUES(101, 'Table 1 - ID 101');
    INSERT INTO table1 VALUES(102, 'Table 1 - ID 102');
    INSERT INTO table1 VALUES(103, 'Table 1 - ID 103');
    INSERT INTO table1 VALUES(104, 'Table 1 - ID 104');
    
    INSERT INTO table2 VALUES(100, 300, 'Table 2 - ID 100, Type 300');
    INSERT INTO table2 VALUES(100, 301, 'Table 2 - ID 100, Type 301');
    INSERT INTO table2 VALUES(100, 302, 'Table 2 - ID 100, Type 302');
    INSERT INTO table2 VALUES(101, 301, 'Table 2 - ID 101, Type 301');
    INSERT INTO table2 VALUES(101, 400, 'Table 2 - ID 101, Type 400');
    INSERT INTO table2 VALUES(101, 302, 'Table 2 - ID 101, Type 302');
    INSERT INTO table2 VALUES(103, 302, 'Table 2 - ID 103, Type 302');
    INSERT INTO table2 VALUES(103, 303, 'Table 2 - ID 103, Type 303');
    INSERT INTO table2 VALUES(103, 300, 'Table 2 - ID 103, Type 300');
    INSERT INTO table2 VALUES(107, 300, 'Table 2 - ID 107, Type 300');
    INSERT INTO table2 VALUES(107, 400, 'Table 2 - ID 107, Type 400');
    
    INSERT INTO table3 VALUES(300, 'Table 3 - Type 300');
    INSERT INTO table3 VALUES(301, 'Table 3 - Type 301');
    INSERT INTO table3 VALUES(302, 'Table 3 - Type 302');
    INSERT INTO table3 VALUES(303, 'Table 3 - Type 303');
    INSERT INTO table3 VALUES(304, 'Table 3 - Type 304');
    INSERT INTO table3 VALUES(305, 'Table 3 - Type 305');
    
    SELECT t1.id as t1_id, t1.data, t2.id as t2_id, t2.type as t2_type, t2.info, t3.type as t3_type, t3.name
      FROM table1 t1, OUTER(table2 t2, table3 t3)
     WHERE t1.id = t2.id AND t2.type = t3.type;
    
    SELECT t1.id as t1_id, t1.data, t2.id as t2_id, t2.type as t2_type, t2.info, t3.type as t3_type, t3.name
      FROM table1 t1, OUTER(table2 t2, OUTER table3 t3)
     WHERE t1.id = t2.id AND t2.type = t3.type;
    
    SELECT t1.id as t1_id, t1.data, t2.id as t2_id, t2.type as t2_type, t2.info, t3.type as t3_type, t3.name
      FROM table1 t1
      LEFT JOIN table2 t2 ON t1.id = t2.id
      LEFT JOIN table3 t3 ON t2.type = t3.type;
    
    SELECT t1.id as t1_id, t1.data, s.join_key as t2_id, s.t2_type, s.info, s.t3_type, s.name
      FROM table1 t1
      LEFT JOIN 
           (SELECT t2.id AS join_key, t2.info, t3.name, t2.type as t2_type, t3.type as t3_type
              FROM table2 t2
              JOIN table3 t3 ON t2.type = t3.type
           ) s ON s.join_key = t1.id;
    
    SELECT t1.id as t1_id, t1.data, t2.id as t2_id, t2.type as t2_type, t2.info, t3.type as t3_type, t3.name
      FROM table1 t1
      LEFT JOIN table2 t2 ON t1.id = t2.id
      JOIN table3 t3 ON t2.type = t3.type;
    

    The first query is the one from the question, with the selected column names specified. The list of column names is the same in every query.

    The second query is a variant, using an extra OUTER keyword. It corresponds to the third query, which was the first answer created by leftjoin. The third query is the first query produced by @leftjoin. The fourth query is the 'current answer' (strictly, the third revision) created by @leftjoin. The fifth query is a variation on the fourth — it produces a different answer, so it is not equivalent.

    These are the outputs of the queries.

    Query 1

    SELECT t1.id as t1_id, t1.data, t2.id as t2_id, t2.type as t2_type, t2.info, t3.type as t3_type, t3.name
      FROM table1 t1, OUTER(table2 t2, table3 t3)
     WHERE t1.id = t2.id AND t2.type = t3.type;
    
    t1_id data t2_id t2_type info t3_type name
    100 Table 1 - ID 100 100 300 Table 2 - ID 100, Type 300 300 Table 3 - Type 300
    100 Table 1 - ID 100 100 301 Table 2 - ID 100, Type 301 301 Table 3 - Type 301
    100 Table 1 - ID 100 100 302 Table 2 - ID 100, Type 302 302 Table 3 - Type 302
    101 Table 1 - ID 101 101 301 Table 2 - ID 101, Type 301 301 Table 3 - Type 301
    101 Table 1 - ID 101 101 302 Table 2 - ID 101, Type 302 302 Table 3 - Type 302
    102 Table 1 - ID 102
    103 Table 1 - ID 103 103 300 Table 2 - ID 103, Type 300 300 Table 3 - Type 300
    103 Table 1 - ID 103 103 302 Table 2 - ID 103, Type 302 302 Table 3 - Type 302
    103 Table 1 - ID 103 103 303 Table 2 - ID 103, Type 303 303 Table 3 - Type 303
    104 Table 1 - ID 104

    Query 2

    SELECT t1.id as t1_id, t1.data, t2.id as t2_id, t2.type as t2_type, t2.info, t3.type as t3_type, t3.name
      FROM table1 t1, OUTER(table2 t2, OUTER table3 t3)
     WHERE t1.id = t2.id AND t2.type = t3.type;
    
    t1_id data t2_id t2_type info t3_type name
    100 Table 1 - ID 100 100 300 Table 2 - ID 100, Type 300 300 Table 3 - Type 300
    100 Table 1 - ID 100 100 301 Table 2 - ID 100, Type 301 301 Table 3 - Type 301
    100 Table 1 - ID 100 100 302 Table 2 - ID 100, Type 302 302 Table 3 - Type 302
    101 Table 1 - ID 101 101 301 Table 2 - ID 101, Type 301 301 Table 3 - Type 301
    101 Table 1 - ID 101 101 302 Table 2 - ID 101, Type 302 302 Table 3 - Type 302
    101 Table 1 - ID 101 101 400 Table 2 - ID 101, Type 400
    102 Table 1 - ID 102
    103 Table 1 - ID 103 103 300 Table 2 - ID 103, Type 300 300 Table 3 - Type 300
    103 Table 1 - ID 103 103 302 Table 2 - ID 103, Type 302 302 Table 3 - Type 302
    103 Table 1 - ID 103 103 303 Table 2 - ID 103, Type 303 303 Table 3 - Type 303
    104 Table 1 - ID 104

    Query 3

    SELECT t1.id as t1_id, t1.data, t2.id as t2_id, t2.type as t2_type, t2.info, t3.type as t3_type, t3.name
      FROM table1 t1
      LEFT JOIN table2 t2 ON t1.id = t2.id
      LEFT JOIN table3 t3 ON t2.type = t3.type;
    
    t1_id data t2_id t2_type info t3_type name
    100 Table 1 - ID 100 100 300 Table 2 - ID 100, Type 300 300 Table 3 - Type 300
    100 Table 1 - ID 100 100 301 Table 2 - ID 100, Type 301 301 Table 3 - Type 301
    100 Table 1 - ID 100 100 302 Table 2 - ID 100, Type 302 302 Table 3 - Type 302
    101 Table 1 - ID 101 101 301 Table 2 - ID 101, Type 301 301 Table 3 - Type 301
    101 Table 1 - ID 101 101 302 Table 2 - ID 101, Type 302 302 Table 3 - Type 302
    101 Table 1 - ID 101 101 400 Table 2 - ID 101, Type 400
    102 Table 1 - ID 102
    103 Table 1 - ID 103 103 300 Table 2 - ID 103, Type 300 300 Table 3 - Type 300
    103 Table 1 - ID 103 103 302 Table 2 - ID 103, Type 302 302 Table 3 - Type 302
    103 Table 1 - ID 103 103 303 Table 2 - ID 103, Type 303 303 Table 3 - Type 303
    104 Table 1 - ID 104

    Query 4

    SELECT t1.id as t1_id, t1.data, s.join_key as t2_id, s.t2_type, s.info, s.t3_type, s.name
      FROM table1 t1
      LEFT JOIN 
           (SELECT t2.id AS join_key, t2.info, t3.name, t2.type as t2_type, t3.type as t3_type
              FROM table2 t2
              JOIN table3 t3 ON t2.type = t3.type
           ) s ON s.join_key = t1.id;
    
    t1_id data t2_id t2_type info t3_type name
    100 Table 1 - ID 100 100 300 Table 2 - ID 100, Type 300 300 Table 3 - Type 300
    100 Table 1 - ID 100 100 301 Table 2 - ID 100, Type 301 301 Table 3 - Type 301
    100 Table 1 - ID 100 100 302 Table 2 - ID 100, Type 302 302 Table 3 - Type 302
    101 Table 1 - ID 101 101 301 Table 2 - ID 101, Type 301 301 Table 3 - Type 301
    101 Table 1 - ID 101 101 302 Table 2 - ID 101, Type 302 302 Table 3 - Type 302
    102 Table 1 - ID 102
    103 Table 1 - ID 103 103 300 Table 2 - ID 103, Type 300 300 Table 3 - Type 300
    103 Table 1 - ID 103 103 302 Table 2 - ID 103, Type 302 302 Table 3 - Type 302
    103 Table 1 - ID 103 103 303 Table 2 - ID 103, Type 303 303 Table 3 - Type 303
    104 Table 1 - ID 104

    Query 5

    SELECT t1.id as t1_id, t1.data, t2.id as t2_id, t2.type as t2_type, t2.info, t3.type as t3_type, t3.name
      FROM table1 t1
      LEFT JOIN table2 t2 ON t1.id = t2.id
      JOIN table3 t3 ON t2.type = t3.type;
    
    t1_id data t2_id t2_type info t3_type name
    100 Table 1 - ID 100 100 300 Table 2 - ID 100, Type 300 300 Table 3 - Type 300
    103 Table 1 - ID 103 103 300 Table 2 - ID 103, Type 300 300 Table 3 - Type 300
    100 Table 1 - ID 100 100 301 Table 2 - ID 100, Type 301 301 Table 3 - Type 301
    101 Table 1 - ID 101 101 301 Table 2 - ID 101, Type 301 301 Table 3 - Type 301
    100 Table 1 - ID 100 100 302 Table 2 - ID 100, Type 302 302 Table 3 - Type 302
    101 Table 1 - ID 101 101 302 Table 2 - ID 101, Type 302 302 Table 3 - Type 302
    103 Table 1 - ID 103 103 302 Table 2 - ID 103, Type 302 302 Table 3 - Type 302
    103 Table 1 - ID 103 103 303 Table 2 - ID 103, Type 303 303 Table 3 - Type 303

    Unless I've missed something, the outputs from Q1 and Q4 are the same (so the answer by @leftjoin is correct); the outputs from the other queries are all different from each other (and from Q1 and Q4).