Search code examples
sqlamazon-web-servicesamazon-athenaprestotrino

like query count does not match


If the total number of records is x and the count of "like" query is y then the count of "not like" query should be x - y

I am getting "x" as the count of "not like" query as shown below:

SELECT COUNT(DISTINCT(b.word)) 
FROM "hunspell"."oscar2_sorted" AS b

Total Count: 9597651

SELECT COUNT(distinct(b.word))
FROM "hunspell"."oscar2_sorted" as b 
INNER JOIN invalidswar AS a 
ON b.word LIKE (CONCAT('%', a.word,'%'))

Like count: 73116

SELECT COUNT(distinct(b.word)) 
FROM "hunspell"."oscar2_sorted" AS b 
INNER JOIN invalidswar AS a
ON b.word NOT LIKE (CONCAT('%', a.word,'%'))

Not Like count: 9597651

Expected: 9524535

I am not sure what am I missing.


Update:

Left join count is close to expected, but still does not look correct.

SELECT COUNT(DISTINCT(b.word))
FROM "hunspell"."oscar2_sorted" AS b 
LEFT JOIN (SELECT DISTINCT(b.word) AS dword 
           FROM "hunspell"."oscar2_sorted" AS b 
           INNER JOIN invalidswar AS a 
           ON b.word LIKE (CONCAT('%', a.word,'%'))) AS d 
ON d.dword = b.word 
WHERE d.dword IS NULL

left join count: 9536539


update 2:

a difference of 12004 is traced back to the difference of how like and regexp_like is executed.

SELECT count(distinct(b.word)) 
          FROM "hunspell"."oscar2_sorted" as b
         INNER JOIN invalidswar AS a 
            ON regexp_like(b.word, a.word)

regex like count: 61112


Solution

  • SELECT COUNT(word)
      FROM (SELECT word 
              FROM "hunspell"."oscar2_sorted"
            EXCEPT DISTINCT
           (SELECT b.word
              FROM "hunspell"."oscar2_sorted" as b
             INNER JOIN invalidswar AS a 
                ON regexp_like(b.word, a.word)))
    

    See: EXCEPT clause and regexp_like