Search code examples
google-sheetsgoogle-sheets-formulatransposearray-formulasgoogle-sheets-query

Find frequency of words in a column in Google Sheets and lookup another value from a different column using formulae


I have 2 columns of data in a Google Sheet. Column1 is unique words or sentences (words are repeated in sentences) and the Column2 is a numeric value next to each (say votes). I am trying to get a list of unique words from Column1 and then the sum of values (votes) from Column2 when the word was present either on its own or in a sentence.

Here is a sample of the data I am working with in Google Sheets:

Term           Votes
apple          20
apple eat      100
orange         30
orange rules   40
rule why       50

This is what the end result looks like:

Word    Votes
apple   120
eat     100
orange  70
rules   40
rule    50
why     50

The way I am doing it now is quite long and I am not sure if this is the best solution.

Here's my solution:

  1. JOIN values in Column1 using a delimiter " " and then SPLIT them using the same delimiter and then TRANSPOSE them into a column all in one step. This way I have a list of all the words used in Column1 in say Column3.
  2. In Column4 pull out all the UNIQUE values and then do a COUNTIF for the unique values from Column3. This way I am able to get the frequency of each unique word by referencing to the lsit of all words.
  3. In order to get the sum of Votes I have to TRANSPOSE Column4 and then QUERY Column1 and Column2 by using dynamic text in the formula. The formula looks like =QUERY(Column1:Column2, "SELECT SUM(Column2) WHERE Column1 CONTAINS '" & referenceToUniqueWord & "'", 1). The reason I have to transpose first is because the query formula outputs 2 cells of data ie Text: sumColumn1 and Number: 'sum of votes'. Since for one cell of unique word I get two cells of data I am not able to drag the formula down and hence I have to do it horizontally.
  4. I finally get three rows of data after the last step: One row is just transposed Column4 (all the unique words). Second row is just the text sumColumn2 from using the QUERY formula. And third row is the actual sum of votes, resulting from individual QUERY formulae. I then transpose these rows to columns and to get my final table I VLOOKUP the frequency values arrived at earlier.

This approach is lengthy and prone to errors. Also doesn't work if the list is large and in the initial JOIN I get an error of limit 50,000 reached. Any ideas to make it better are welcome. I know this can be done much easier using Scripts but I'd prefer to have it done using only formulae.


Solution

  • try:

    =ARRAYFORMULA(QUERY(SPLIT(TRANSPOSE(SPLIT(QUERY(TRANSPOSE(QUERY(
     IF(IFERROR(SPLIT(A:A, " "))="",,"♠"&SPLIT(A:A, " ")&"♦"&B:B)
     ,,999^99)),,999^99), "♠")), "♦"), 
     "select Col1,sum(Col2) 
      group by Col1
      order by sum(Col2) desc 
      label sum(Col2)''"))
    

    0