For my problem, I'm trying to find the category with the highest count of words within various sentences:
Table 1:
Word | Category |
---|---|
'Rain' | A |
'He' | A |
'France' | C |
... | ... |
'Work' | B |
Table 2:
ID | Sentence |
---|---|
1 | 'I'd like to go to France sometime.' |
2 | 'He won't come in a rain like this.' |
3 | 'They agreed to work together.' |
... | ... |
Ideally I would like to create a table like this:
Result:
ID | Sentence | Category with highest word count |
---|---|---|
1 | 'I'd like to go to France sometime.' | C |
2 | 'He won't come in a rain like this.' | A |
3 | 'They agreed to work together.' | B |
... | ... | ... |
My plan was to create columns for each word and then aggregate them by their categories to find the category with the highest word count for each ID, but the code I am planning to use requires a lot of manual input. Is there a better way to do this?
select ID,
Sentence
regexp_count(Assignee, 'Rain') + regexp_count(Assignee, 'He') + ... as CategoryA_Count
... as CategoryB_count
... as CategoryC_count
from Table 2```
This could work:
WITH
-- your input ...
words(word,category) AS (
SELECT 'Rain' ,'A'
UNION ALL SELECT 'He' ,'A'
UNION ALL SELECT 'France','C'
UNION ALL SELECT 'Work' ,'B'
)
,
-- your other input ...
sentences(id,sentence) AS (
SELECT 1,'I''d like to go to France sometime.'
UNION ALL SELECT 2,'He won''t come in a rain like this.'
UNION ALL SELECT 3,'They agreed to work together.'
)
-- here goes the query ...
-- Each time a word out of "words" is found in "sentences", that
-- counts once.
SELECT
id
, sentence
, category
FROM words
-- "Found" can be expressed as a join of "words" and
-- "sentences" that succeeds if that word is found in a case in-
-- sensitive match ( REGEXP_ILIKE() ) as a word match: "\b" means:
-- word boundary, hence we match with '\b'||word||'\b', so, for
-- example: '\bHe\b', and not just 'He'. REGEXP_ILIKE() returns a
-- Boolean: TRUE if a match is found.
JOIN sentences ON REGEXP_ILIKE(sentence,'\b'||word||'\b')
GROUP BY
id
, sentence
, category
-- we count the rows produced by the complex join
-- by id, sentence and category.
-- we use Vertica's analytic limit clause that returns just the
-- first LIMIT <n> rows per partition in the PARTITION BY clause
-- to get the highest COUNT(*), we ORDER BY that value DESC(ending)
LIMIT 1 OVER(PARTITION BY sentence ORDER BY COUNT(*) DESC)
-- out id | sentence | category
-- out ----+------------------------------------+----------
-- out 1 | I'd like to go to France sometime. | C
-- out 2 | He won't come in a rain like this. | A
-- out 3 | They agreed to work together. | B