I'm trying find all the values in my hosts table, which do not contain partial match to values in my maildomains table.
hosts
+-------------------+-------+
| host | score |
+-------------------+-------+
| www.gmail.com | 489 |
| www.hotmail.com | 653 |
| www.google.com | 411 |
| w3.hotmail.ca | 223 |
| stackexchange.com | 950 |
+-------------------+-------+
maildomains
+---------------+
| email |
+---------------+
| gmail |
| hotmail |
| outlook |
| mail |
+---------------+
Specifically, I am looking to do SELECT * of hosts where the hosts.host NOT LIKE any value in '%.maildomains.email%'
Desired output:
+-------------------+-------+
| host | score |
+-------------------+-------+
| www.google.com | 411 |
| stackexchange.com | 950 |
+-------------------+-------+
Here's how I think it should work logically:
SELECT h.*, m.email FROM (SELECT h.* FROM hosts WHERE score > 100 as h)
h LEFT OUTER JOIN maildomains m ON (h.host LIKE CONCAT('%.',m.email,'%'))
WHERE m.email IS NULL
This results in error 10017: both left and right aliases encountered in join ''%''
I also managed to get a similar query to run without error as CROSS JOIN, but it yields bad results:
SELECT h.*, m.email FROM (SELECT h.* FROM hosts WHERE score > 100 as h)
h CROSS JOIN maildomains m
WHERE h.host NOT LIKE CONCAT('%.',m.email,'%')
+-------------------+---------+---------+
| p.host | p.score | m.email |
+-------------------+---------+---------+
| www.gmail.com | 489 | hotmail |
| www.gmail.com | 489 | outlook |
| www.gmail.com | 489 | mail |
| www.hotmail.com | 653 | gmail |
| www.hotmail.com | 653 | outlook |
| www.hotmail.com | 653 | mail |
| www.google.com | 411 | gmail |
| www.google.com | 411 | hotmail |
| www.google.com | 411 | outlook |
| www.google.com | 411 | mail |
| w3.hotmail.ca | 223 | gmail |
| w3.hotmail.ca | 223 | outlook |
| w3.hotmail.ca | 223 | mail |
| stackexchange.com | 950 | gmail |
| stackexchange.com | 950 | hotmail |
| stackexchange.com | 950 | outlook |
| stackexchange.com | 950 | mail |
+-------------------+---------+---------+
I appreciate any and all guidance.
If your Hive version is 0.13 or newer, than you could use a subquery in the WHERE
clause to filter the rows from the hosts
table. The following is a more generalized approach that would not require you to enumerate all of the top-level domains you might find in your data:
SELECT host, score
FROM hosts
WHERE
regexp_extract(hosts.host, "(?:.*?\\.)?([^.]+)\\.[^.]+", 1) NOT IN
(SELECT email FROM maildomains);
This approach isolates the portion of the host domain just before the TLD with the regexp_extract
and then checks to see if that domain name occurs in the subquery on the maildomains
table.