Search code examples
sqlhadoophiveapache-pigimpala

Join table by string matching in Hive or Impala or Pig


I have two tables A and B, where B is huge (20 million by 300) and A is of moderate size (300k by 10). A contains one column that is address and B contains 3 columns that can be put together to form a proper street address. For example, in A, the address column could be:

id  | Address
-----------
233 | 123 Main St

and in B we could have:

Number  |  Street_name  | Street_suffix | Tax
------------------------------------------------
123     |  Main         | Street        | 320.2

I want to join them using string matching similar to LIKE with something like the following:

select A.id, B.Tax
from A
  left join B
    on A.Address **contains** B.Number
    and A.Address **contains** B.Street_name;

Basically I am trying to match the records by saying that if A's address contains B's number and street_name, then I say they are the same address (In reality, I also have city, state and zip code. But I chose to ignore those for illustration purposes).

The 2 contains part is something I am not sure how to implement. Any ideas?

I am on Cloudera's Hue Hadoop distribution, where I have access to Hive (1.1.0, unfortunately because 1.2.0 has Levenshtein distance function), Impala (v2.3.0) and Pig (0.12.0-cdh5.5.0).


Solution

  • You can use join with equality conditions only, but -
    You can cross join and filter.

    select      A.id, B.Tax
    from        A cross join B
    where       concat(' ',A.Address,' ') like concat('% ',cast(B.Number as string),' %')
            and concat(' ',A.Address,' ') like concat('% ',B.Street_name,' %')
    ;
    

    Demo

    hive> create table A (id int,Address string);
    OK
    hive> create table B (number int,Street_name string,Street_suffix string,tax decimal(12,2));
    OK
    hive> insert into A values (233,'123 Main St');
    Query ID = ...
    OK
    hive> insert into B values (123,'Main','Street',320.2);
    Query ID = ...
    OK
    hive> select      A.id, B.Tax
        > from        A cross join B
        > where       concat(' ',A.Address,' ') like concat('% ',cast(B.Number as string),' %')
        >         and concat(' ',A.Address,' ') like concat('% ',B.Street_name,' %')
        > ;
    Warning: Map Join MAPJOIN[8][bigTable=b] in task 'Stage-3:MAPRED' is a cross product
    Query ID = ...
    OK
    233 320.2
    hive>