Search code examples
sqlpreston-gramcross-join

unigrams in SQL


I have the following table (tbl)

id | text
1  | example text
2  | this is an example text
3  | text text text

and I would like to return this table as output (unigrams)

ngram   | counts | n_ids
text    |  5     | 3
example |  2     | 2
this    |  1     | 1
is      |  1     | 1
an      |  1     | 1

And I thought of solving this using a cross join (I'm on Presto).

WITH 
  ngram_array AS (
  SELECT id, ngrams(split(text, ' ')) ngram_array FROM tbl
 ),
SELECT
  array_join(ngram, ' ') ngram,
  count(*) as counts,
  count(id) as n_ids
FROM ngram_array CROSS JOIN UNNEST (ngram_array) AS t(ngram)
GROUP BY ngram

This seems to be giving me ngrams but the columns counts and n_ids have the same value while I expect a difference because once is the count of the ngram over the entire sample and the second one is the number of documents each ngram is present.

Do you know what I might be doing wrong and is there a fiddle where I can test this online (I know fiddles for Postgres but can't find one for Presto).


Solution

  • You can split text as needed into string array,unnest it and use distinct option for count for ids in group by:

    -- sample data
    WITH dataset (id, text) AS (
        VALUES (1,   'example text'),
            (2,   'this is an example text'),
            (3,   'text text text')
    )
    
    --query
    SELECT word,
        count(*) counts,
        count(distinct id) n_ids -- count distinct ids
    FROM (
            SELECT id,
                word
            FROM dataset
                CROSS JOIN UNNEST (split(text, ' ')) as t(word)
        )
    GROUP BY word
    ORDER BY counts desc -- order for output
    

    Output:

    word counts n_ids
    text 5 3
    example 2 2
    this 1 1
    is 1 1
    an 1 1