I have two tables A
and B
, where B
is huge (20 million by 300) and A
is of moderate size (300k by 10). A
contains one column that is address and B
contains 3 columns that can be put together to form a proper street address. For example, in A
, the address column could be:
id | Address
-----------
233 | 123 Main St
and in B
we could have:
Number | Street_name | Street_suffix | Tax
------------------------------------------------
123 | Main | Street | 320.2
I want to join them using string matching similar to LIKE
with something like the following:
select A.id, B.Tax
from A
left join B
on A.Address **contains** B.Number
and A.Address **contains** B.Street_name;
Basically I am trying to match the records by saying that if A
's address contains B
's number and street_name, then I say they are the same address (In reality, I also have city, state and zip code. But I chose to ignore those for illustration purposes).
The 2 contains part is something I am not sure how to implement. Any ideas?
I am on Cloudera's Hue Hadoop distribution, where I have access to Hive (1.1.0, unfortunately because 1.2.0 has Levenshtein distance function), Impala (v2.3.0) and Pig (0.12.0-cdh5.5.0).
You can use join with equality conditions only, but -
You can cross join and filter.
select A.id, B.Tax
from A cross join B
where concat(' ',A.Address,' ') like concat('% ',cast(B.Number as string),' %')
and concat(' ',A.Address,' ') like concat('% ',B.Street_name,' %')
;
Demo
hive> create table A (id int,Address string);
OK
hive> create table B (number int,Street_name string,Street_suffix string,tax decimal(12,2));
OK
hive> insert into A values (233,'123 Main St');
Query ID = ...
OK
hive> insert into B values (123,'Main','Street',320.2);
Query ID = ...
OK
hive> select A.id, B.Tax
> from A cross join B
> where concat(' ',A.Address,' ') like concat('% ',cast(B.Number as string),' %')
> and concat(' ',A.Address,' ') like concat('% ',B.Street_name,' %')
> ;
Warning: Map Join MAPJOIN[8][bigTable=b] in task 'Stage-3:MAPRED' is a cross product
Query ID = ...
OK
233 320.2
hive>