Please see: Find every Unique 2 word pairings from a column of terms in Google Sheets
In the above question the asker was was looking for unique 2 word pairings and gave the example of having 3 words in a cell.
The formula given works if there is only 1 to 3 words in a cell, but as soon as you go over (4+ words) the formula returns a #value Error "in ARRAY_LITERAL, an Array Literal was missing values for one or more rows"
=arrayformula(query(flatten({query({if(A:A<>"",iferror(trim(split(regexreplace(regexreplace(regexreplace(lower(A:A),"[^A-Za-z\ \']+",""),"([\w\']+\ [\w\']+)","$1,"),"(.*\,)(.*)","$1"),",")),),)},"where Col1 is not null",0);query({if(A:A<>"",iferror(trim(split(regexreplace(regexreplace(regexreplace(lower(if(len(A:A)=len(regexreplace(A:A," ",""))+1,,A:A)),"[^A-Za-z\ \']+",""),"(\w*)(\ )([\w\']+\ [\w\']+)","$3,"),"(.*\,)(.*)","$1"),",")),),)},"where Col1 is not null",0)}),"select Col1,count(Col1) where Col1 is not null group by Col1 label Col1 'Word pairs',Count(Col1) 'Qty' ",0))
Is it possible to make the formula work with cells that include 25, 50, 100+ words and then be able to exclude common words from a column like done in Extracting and counting unique word frequency from a range
=ArrayFormula(query(transpose(split(query(regexreplace(A1:A5, "[^A-Za-z\s/']" ,""),,50000)," ")), "Select Col1, Count(Col1) where not UPPER(Col1) matches '\b"&textjoin("|", 1, UPPER(J1:J20))&"\b' group by Col1 order by Count(Col1) desc label Count(Col1)''"))
I'm trying to find unique word pairings in sentences and then count how many times they appear.
I've edited my question to try to make it more readable (sorry still don't completely now how to use stack-overflow formatting or respond to comments etc.)
TEXT column and EXCLUSIONS is the input
UNIQUE PAIRS and COUNT would be the Output
TEXT | UNIQUE PAIRS | COUNT | EXCLUSIONS |
---|---|---|---|
I wake up with night sweats and struggle with staying asleep. | night sweats | 2 | I wake |
I wake up from night sweats and struggle with dry hair. | staying asleep | 1 | wake up |
dry hair | 1 | up with | |
with night | |||
sweats and | |||
and struggle | |||
with staying | |||
struggle with | |||
up from | |||
from night | |||
with dry |
Purpose is to be able to find common word pairings like
bed head, dry hair, night sweats, etc.
That you can't find just by looking at the frequency of single words
try:
=INDEX(LAMBDA(a, LAMBDA(x, QUERY(QUERY(x&" "&{QUERY(x,
"offset 1", ); ""}, "where not Col1 contains '×××'", ),
"select Col1,count(Col1) group by Col1 order by count(Col1) desc label count(Col1)''"))
(FLATTEN(SPLIT(QUERY(FLATTEN({REGEXREPLACE(LOWER(a), "[\.,\?!\(\)]", ),
IFERROR(a/0, "×××")}),,9^9), " "))))(A1:A2))
=INDEX(LAMBDA(a, LAMBDA(x, QUERY(QUERY(x&" "&{QUERY(x,
"offset 1", ); ""}, "where not Col1 contains '×××'", ),
"select Col1,count(Col1)
where not Col1 matches '"&TEXTJOIN("|", 1, UNIQUE(LOWER(E2:E)))&"'
group by Col1 order by count(Col1) desc label count(Col1)''"))
(FLATTEN(SPLIT(QUERY(FLATTEN({REGEXREPLACE(LOWER(a), "[\.,\?!\(\)]", ),
IFERROR(a/0, "×××")}),,9^9), " "))))(A1:A2))