I have the following table (tbl
)
id | text
1 | example text
2 | this is an example text
3 | text text text
and I would like to return this table as output (unigrams)
ngram | counts | n_ids
text | 5 | 3
example | 2 | 2
this | 1 | 1
is | 1 | 1
an | 1 | 1
And I thought of solving this using a cross join (I'm on Presto).
WITH
ngram_array AS (
SELECT id, ngrams(split(text, ' ')) ngram_array FROM tbl
),
SELECT
array_join(ngram, ' ') ngram,
count(*) as counts,
count(id) as n_ids
FROM ngram_array CROSS JOIN UNNEST (ngram_array) AS t(ngram)
GROUP BY ngram
This seems to be giving me ngrams but the columns counts
and n_ids
have the same value while I expect a difference because once is the count of the ngram over the entire sample and the second one is the number of documents each ngram is present.
Do you know what I might be doing wrong and is there a fiddle where I can test this online (I know fiddles for Postgres but can't find one for Presto).
You can split text as needed into string array,unnest
it and use distinct
option for count
for ids in group by
:
-- sample data
WITH dataset (id, text) AS (
VALUES (1, 'example text'),
(2, 'this is an example text'),
(3, 'text text text')
)
--query
SELECT word,
count(*) counts,
count(distinct id) n_ids -- count distinct ids
FROM (
SELECT id,
word
FROM dataset
CROSS JOIN UNNEST (split(text, ' ')) as t(word)
)
GROUP BY word
ORDER BY counts desc -- order for output
Output:
word | counts | n_ids |
---|---|---|
text | 5 | 3 |
example | 2 | 2 |
this | 1 | 1 |
is | 1 | 1 |
an | 1 | 1 |