Search code examples
regexcountgoogle-sheetsformulasarray-formulas

Efficient cell formula counting of substrings in a range of cells


I have a Google spreadsheet with two columns: A and C. Column A has over a thousand rows of names/subjects in their own cells, and Column C has several hundred rows with a variable combination of the above names/subjects in its various cells, with each name/subject separated from the others by a semicolon. Each of these columns gains new entries frequently throughout the course of the day.

As a very simplified visual example of the setup:

**Column A: Names  | Column B: Occurrences | Column C: List**

A2: Adam           | B2: [Blank]           | C2: Charles; Adam

A3: Bob            | B3: [Blank]           | C3: Adam

A4: Charles        | B4: [Blank]           | C4: Smith, Charles

A5: Smith, Charles | B5: [Blank]           | C5: Bob Evans

A6: Bob Evans      | B6: [Blank]           | C6: Smith, Charles; Charles; Bob

A7: [etc.]         | B7: [Blank]           | C7: Bob Evans; Charles; Bob

A8: [etc.]         | B8: [Blank]           | C8: [etc.] 

Currently, I’m using the following formula to count the number of times that each string from Column A (here, A2) appears as a substring in Column C (here, C2 through C7):

=ARRAYFORMULA(IF(A2="","",(SUMPRODUCT(REGEXMATCH(REGEXREPLACE(REGEXREPLACE($C$2:$C$7,"([\(\)\?])", ""),"(\w+),{0,1}\s+(\w+)","$1$2"),".*(^|\s)"& trim(REGEXREPLACE(REGEXREPLACE($A2,"([\(\)\?])", ""),"(\w+),{0,1}\s+(\w+)","$1$2"))&"(;|$).*")))))

This gives the correct totals, but it seems to be incredibly processing heavy once you scale up; changing or adding any one entry to Column C causes the entire sheet to recalculate its thousands of entries, and it takes several minutes for it to turn out the new totals. Many of the REGEXREPLACE values are used here because some of the entries have punctuation like “()” and “?”, on account of cells like “Erōs”, “Olympic Games (23rd : 1984 : Los Angeles, Calif.)”, and “Thomas, Aquinas, Saint, 1225?-1274”.

The closest counting alternative that I’ve come up with is the following:

=SUMPRODUCT((LEN(C$2:C$7)-LEN(SUBSTITUTE(C$2:C$7,A2,"")))/LEN(A2))

Testing shows that this much simpler formula can recalculate the whole sheet in a few seconds, but it doesn’t actually count the entries correctly. In the above example of C2-C7, it would give totals of 4 and 5 for “Bob” and “Charles” because it doesn’t distinguish between “Bob” and “Bob Evans” or “Charles” and “Smith, Charles.” It should properly find 2 and 3 respectively.

Is there an effective way to adjust the above formula or create a new one which will count all the substrings properly—limiting the sums to exact matches to Column A as they're found between semicolons in Column C—without causing the sheet’s calculations to freeze up for several minutes at a time? Regular expressions were the route I first ended up taking, but I expect that those operations are the reason that it takes so long.


Solution

  • Try
    B2:

    =SUMPRODUCT(TRIM(A2)=TRIM(SPLIT($C$2:$C$8,";")))