Search code examples
word-countclickhouse

clickhouse approach for word frequency count on textual field


I have a Clickhouse table, where one of the fields contains a textual description (~300 words).

For example Reviews:

Rev_id    Place_id    Stars    Category    Text
1         12           3        Food       Nice food but a bad dirty place.
2         31           4        Sport      Not bad, they have everything.
3         55           1        Bar        Poor place,bad audience.

I'd like to make some word count analysis, such as general word frequency count (how many times each of the words has appeared) or top-K words per Category.

In the example:

word    count
bad     3
place   2

... Is there a way to do it solely in ClickHouse without involving programming languages?


Solution

  • SELECT
        arrayJoin(splitByChar(' ', replaceRegexpAll(x, '[.,]', ' '))) AS w,
        count()
    FROM
    (
        SELECT 'Nice food but a bad dirty place.' AS x
        UNION ALL
        SELECT 'Not bad, they have everything.'
        UNION ALL
        SELECT 'Poor place,bad audience.'
    )
    GROUP BY w
    ORDER BY count() DESC
    
    ┌─w──────────┬─count()─┐
    │            │       4 │
    │ bad        │       3 │
    │ place      │       2 │
    │ have       │       1 │
    │ Poor       │       1 │
    │ food       │       1 │
    │ Not        │       1 │
    │ they       │       1 │
    │ audience   │       1 │
    │ Nice       │       1 │
    │ but        │       1 │
    │ dirty      │       1 │
    │ a          │       1 │
    │ everything │       1 │
    └────────────┴─────────┘
    
    
    SELECT CATEGORY, ....
    GROUP BY CATEGORY, w