Search code examples
joinhiveleft-joinbetweenimpala

Reduction in the number of records using range join


Following the my question I have the following tables the first (Range) includes range of values and additional columns:

row  | From   |  To     | Country ....
-----|--------|---------|---------
1    | 1200   |   1500  |
2    | 2200   |   2700  |
3    | 1700   |   1900  |
4    | 2100   |   2150  |
... 

The From and To are bigint and are exclusive. The Range table includes 1.8M records. Additional table (Values) contains 2.7M records and looks like:

 row     | Value  | More columns....
 --------|--------|----------------
    1    | 1777   |    
    2    | 2122   |    
    3    | 1832   |    
    4    | 1340   |    
    ... 

I would like to create one table as followed:

row      | Value  | From   | To    | More columns....
 --------|--------|--------|-------|---
    1    | 1777   | 1700   | 1900  |
    2    | 2122   | 2100   | 2150  |   
    3    | 1832   | 1700   | 1900  |   
    4    | 1340   | 1200   | 1500  |   
    ... 

I used the left outer join in the following code:

set n=1000;

select      v.id
           ,v.val
           ,r.from_val
           ,r.to_val

from      val v
        left outer join    

 (select  r.*
                   ,floor(from_val/${hiveconf:n}) + pe.i    as match_val

            from    val_range r
                    lateral view    posexplode
                                    (
                                        split
                                        (
                                            space
                                            (
                                                cast
                                                (
                                                    floor(to_val/${hiveconf:n}) 
                                                  - floor(from_val/${hiveconf:n}) 

                                                    as int
                                                )
                                            )
                                           ,' '
                                        )
                                    ) pe as i,x
            ) r



            on      floor(v.val/${hiveconf:n})    =
                    r.match_val

where       v.val between r.from_val and r.to_val

order by    v.id       
;

However there is a vast reduction in the number of records of the new table ~31k records out of 2.7M. How can it be if I use the left outer join? How can I fix it?


Solution

  • Assuming we have a v.id

    set n=1000;
    
    select      v.id
               ,r.from_val
               ,r.to_val
    
    from                    val     v 
    
                left join  (select      v.id
                                       ,r.from_val
                                       ,r.to_val
    
                            from                val     v 
    
                                        join    (...)   r 
    
                                        on      floor(v.val/${hiveconf:n})    =
                                                r.match_val
    
                            where       v.val between r.from_val and r.to_val
                            ) r
    
                on          r.id    =
                            v.id
    
    order by    v.id       
    

    As for the OP request, here is the full query:

    set n=1000;
    
    select      v.id
               ,r.from_val
               ,r.to_val
    
    from                    val     v 
    
                left join  (select      v.id
                                       ,r.from_val
                                       ,r.to_val
    
                            from                val     v 
    
                                        join   (select  r.*
                                                       ,floor(from_val/${hiveconf:n}) + pe.i    as match_val
    
                                                from    val_range r
                                                        lateral view    posexplode
                                                                        (
                                                                            split
                                                                            (
                                                                                space
                                                                                (
                                                                                    cast
                                                                                    (
                                                                                        floor(to_val/${hiveconf:n}) 
                                                                                      - floor(from_val/${hiveconf:n}) 
    
                                                                                        as int
                                                                                    )
                                                                                )
                                                                               ,' '
                                                                            )
                                                                        ) pe as i,x
                                                ) r
    
                                        on      floor(v.val/${hiveconf:n})    =
                                                r.match_val
    
                            where       v.val between r.from_val and r.to_val
                            ) r
    
                on          r.id    =
                            v.id
    
    order by    v.id