Search code examples
sql-serverword-frequency

SQL Server Word Frequency for Each ID


I have a SQL Server table with an ID and a text column with size of ~ 15,000 words. For each ID, I want to return a list of every unique word and the number of times it occurred.

CREATE TABLE MyComments (
Textid char(20) NOT NULL,
Comments VARCHAR(MAX)
)

Answer would look like
    TextID  Word     WordCount
    ------- ------   ----------
    1        cost     5
    1        expense  7
    2        cost     12
    2        revenue  11
 

..

Solution

  • This is not really something SQL Server is designed to do; it will be extremely expensive to do this for a table that is large (large number of rows, extremely long text like 15K, or both). If your table is small:

    SELECT c.Textid, Word = s.value, WordCount = COUNT(*)
      FROM dbo.MyComments AS c
      CROSS APPLY STRING_SPLIT(c.Comments, ' ') AS s
      GROUP BY c.Textid, s.value
      ORDER BY Textid;
    

    At scale, this query will be terrible.

    As @lptr pointed out, likely better at large scale to rearrange this query in a more complicated way so that the splitting is separated from the rest of the logic:

    SELECT c.Textid, Word = s.value, s.WordCount
      FROM dbo.MyComments AS c
      CROSS APPLY
      (
        SELECT value = CONVERT(varchar(255), value),
               WordCount = COUNT(*)
          FROM STRING_SPLIT(c.Comments, ' ')
          WHERE LEN(value) > 0
          GROUP BY CONVERT(varchar(255), value)
      ) AS s
      ORDER BY c.textid;
    

    But this is still pretty slow (for an 8,000-row table it did almost 600,000 reads and took ~15 seconds). I don't believe SQL Server is the place to do this work.