Search code examples
hadoophiveclouderahiveqlimpala

Hive Query Language return only values where NOT LIKE a value in another table


I'm trying find all the values in my hosts table, which do not contain partial match to values in my maildomains table.

hosts
+-------------------+-------+
|       host        | score |
+-------------------+-------+
| www.gmail.com     |   489 |
| www.hotmail.com   |   653 |
| www.google.com    |   411 |
| w3.hotmail.ca     |   223 |
| stackexchange.com |   950 |
+-------------------+-------+
maildomains 
+---------------+
| email         |
+---------------+
| gmail         |
| hotmail       |
| outlook       |
| mail          |
+---------------+

Specifically, I am looking to do SELECT * of hosts where the hosts.host NOT LIKE any value in '%.maildomains.email%'

Desired output:
+-------------------+-------+
|       host        | score |
+-------------------+-------+
| www.google.com    |   411 |
| stackexchange.com |   950 |
+-------------------+-------+

Here's how I think it should work logically:

SELECT h.*, m.email FROM (SELECT h.* FROM hosts WHERE score > 100 as h)
h LEFT OUTER JOIN maildomains m ON (h.host LIKE CONCAT('%.',m.email,'%'))
WHERE m.email IS NULL

This results in error 10017: both left and right aliases encountered in join ''%''

I also managed to get a similar query to run without error as CROSS JOIN, but it yields bad results:

SELECT h.*, m.email FROM (SELECT h.* FROM hosts WHERE score > 100 as h)
h CROSS JOIN maildomains m 
WHERE h.host NOT LIKE CONCAT('%.',m.email,'%')

+-------------------+---------+---------+
|      p.host       | p.score | m.email |
+-------------------+---------+---------+
| www.gmail.com     |     489 | hotmail |
| www.gmail.com     |     489 | outlook |
| www.gmail.com     |     489 | mail    |
| www.hotmail.com   |     653 | gmail   |
| www.hotmail.com   |     653 | outlook |
| www.hotmail.com   |     653 | mail    |
| www.google.com    |     411 | gmail   |
| www.google.com    |     411 | hotmail |
| www.google.com    |     411 | outlook |
| www.google.com    |     411 | mail    |
| w3.hotmail.ca     |     223 | gmail   |
| w3.hotmail.ca     |     223 | outlook |
| w3.hotmail.ca     |     223 | mail    |
| stackexchange.com |     950 | gmail   |
| stackexchange.com |     950 | hotmail |
| stackexchange.com |     950 | outlook |
| stackexchange.com |     950 | mail    |
+-------------------+---------+---------+

I appreciate any and all guidance.


Solution

  • If your Hive version is 0.13 or newer, than you could use a subquery in the WHERE clause to filter the rows from the hosts table. The following is a more generalized approach that would not require you to enumerate all of the top-level domains you might find in your data:

    SELECT host, score
    FROM hosts
    WHERE
      regexp_extract(hosts.host, "(?:.*?\\.)?([^.]+)\\.[^.]+", 1) NOT IN
        (SELECT email FROM maildomains);
    

    This approach isolates the portion of the host domain just before the TLD with the regexp_extract and then checks to see if that domain name occurs in the subquery on the maildomains table.