Search code examples
sqlhadoophivequery-optimizationhiveql

Optimizing Hive Query with UNION ALL and RANK with order


Current Scenario : I have this query which does a union all on two sets of data and then selects fields based on the rank But as per my analysis, the entire data set can be removed on one side of UNION

Analysis: So if you look at the below query - I think, we can completely ignore and remove the data set that is generated by joins of tables: P,Q,R,S and T

also can I replace unionall with union here

Query:

SELECT OUTERV.f1, ... OUTERV.f30
FROM 
      (
        SELECT 
          unionV.f1, ...unionV.f30, ROW_NUMBER() over (PARTITION BY unionV.ifc order by  unionV.orderNUM_ asc) rank_
        FROM 
          (
            SELECT f1 .. few fields, 1 as ORDERNUM_ 
            FROM 
            A 
            JOIN B on A.id = B.id 
            JOIN ( SELECT few remaining fields FROM C )  
            C ON C.id = B.id
            JOIN D ON C.id = D.id
            JOIN E ON E.id = D.id
            JOIN F on F.id = E.id
            UNION ALL 
            SELECT 
              f1, f2, ...f30 , 2 as ORDERNUM_ 
            FROM 
            P 
            JOIN Q ON P.id = Q.id
            JOIN R ON Q.id = R.id
            JOIN S on S.id = R.id
            JOIN T on S.id = T.id

          )unionV
      ) 
OUTERV where 
OUTERV.rank_ = 1

Request: Please confirm if my analysis is correct.


Solution

  • I disagree with the Analysis; it makes assumptions which may not be true. However if you can guarantee that all IFC values in the second part of the union exist in the first part of the union, and it's ALWAYS that way, then your analysis is correct.

    Essentially what the query you have does is trust the data from the first set of the union more than the second set of the union. However, if there is an IFC value in the second set not in the first; it must come from the 2nd part of the union; thus removing the second part of the union could remove records.

    Example:

    • Assume unionV.ifc is sourced from tables A and P on each side of the union
    • Assume the following data in A & P

    .

    A.ifc
    A
    B
    
    P.ifc
    A
    Z
    

    In your present query the results would be

    A (from A table)
    B (from A table)
    Z (from P Table)  
    

    If you eliminate the 2nd part of the union you eliminate P and therefore Z would be excluded from the results; hence they are not equal and you can't remove the 2nd part of the union.

    Now, if all ifc's defined in the second set are contained in the first set defined by the unions, and that is ALWAYS true; then yes you could eliminate the 2nd part of the union. As the first set contains the complete set in the first place. However, if that's not a guaranteed true statement, then the current approach using the union on a...F and P...T generates the "Master set"