Search code examples
sqlvertica

SQL - Check how many times substrings within a column appear in within another column of strings


For my problem, I'm trying to find the category with the highest count of words within various sentences:

Table 1:

Word Category
'Rain' A
'He' A
'France' C
... ...
'Work' B

Table 2:

ID Sentence
1 'I'd like to go to France sometime.'
2 'He won't come in a rain like this.'
3 'They agreed to work together.'
... ...

Ideally I would like to create a table like this:

Result:

ID Sentence Category with highest word count
1 'I'd like to go to France sometime.' C
2 'He won't come in a rain like this.' A
3 'They agreed to work together.' B
... ... ...

My plan was to create columns for each word and then aggregate them by their categories to find the category with the highest word count for each ID, but the code I am planning to use requires a lot of manual input. Is there a better way to do this?

select ID, 
Sentence
regexp_count(Assignee, 'Rain') + regexp_count(Assignee, 'He') + ... as CategoryA_Count
... as CategoryB_count
... as CategoryC_count
from Table 2```

Solution

  • This could work:

    WITH
    -- your input ...
    words(word,category) AS (
              SELECT 'Rain'  ,'A'
    UNION ALL SELECT 'He'    ,'A'
    UNION ALL SELECT 'France','C'
    UNION ALL SELECT 'Work'  ,'B'
    )
    ,
    -- your other input ...
    sentences(id,sentence) AS (
              SELECT 1,'I''d like to go to France sometime.'
    UNION ALL SELECT 2,'He won''t come in a rain like this.'
    UNION ALL SELECT 3,'They agreed to work together.'
    )
    -- here goes the query ...
    -- Each time a word out of "words" is found in "sentences", that
    -- counts once. 
    SELECT
      id  
    , sentence
    , category
    FROM words
    -- "Found" can be expressed as a join of "words" and
    -- "sentences" that succeeds if that word is found in a case in-
    -- sensitive match ( REGEXP_ILIKE() ) as a word match: "\b" means:
    -- word boundary, hence we match with '\b'||word||'\b', so, for 
    -- example: '\bHe\b', and not just 'He'. REGEXP_ILIKE() returns a 
    -- Boolean: TRUE if a match is found.
    JOIN sentences ON REGEXP_ILIKE(sentence,'\b'||word||'\b')
    GROUP BY
      id  
    , sentence
    , category
    -- we count the rows produced by the complex join
    -- by id, sentence and category. 
    -- we use Vertica's analytic limit clause that returns just the
    -- first LIMIT <n> rows per partition in the PARTITION BY clause
    -- to get the highest COUNT(*), we ORDER BY that value DESC(ending)
    LIMIT 1 OVER(PARTITION BY sentence ORDER BY COUNT(*) DESC)
    -- out  id |              sentence              | category 
    -- out ----+------------------------------------+----------
    -- out   1 | I'd like to go to France sometime. | C
    -- out   2 | He won't come in a rain like this. | A
    -- out   3 | They agreed to work together.      | B