I have a table with 1000 rows and 3 variables (an ID, a COUNTRY and a string variable "VAR1"). VAR1 is a sentence composed by words separated by a space.
I want, by COUNTRY, a count of all couples (or all triplets) of words. Very important the couples (or triplets) are a cross-over of all words (not necessarily step by step). Perhaps, we can do it with an ngrams hql-function but when i use it, it's counting words step by step and not all the crossing.
Let's go for the example to give you an idea of what i want :
> **"ID" "COUNTRY" "VAR1"**
> "1" "CANADA" "dad mum child"
> "2" "CANADA" "dad mum dog"
> "3" "USA" "bird lion car"
VAR1 is not necessarily a length of 3 words. It's just to simplify.
The resutls in 4 steps that i want for a 2-ngrams:
STEP 1 : THE MOST IMPORTANT STEP : crossing words by 2
> "1" "CANADA" "dad mum" 1
> "1" "CANADA" "dad child" 1
> "1" "CANADA" "mum dad" 1
> "1" "CANADA" "mum child" 1
> "1" "CANADA" "child dad" 1
> "1" "CANADA" "child mum" 1
> "2" "CANADA" "dad mum" 1
> "2" "CANADA" "dad dog" 1
> "2" "CANADA" "mum dad" 1
> "2" "CANADA" "mum dog" 1
> "2" "CANADA" "dog dad" 1
> "2" "CANADA" "dog mum" 1
> "3" "USA" "bird lion" 1
> "3" "USA" "bird car" 1
> "3" "USA" "lion bird" 1
> "3" "USA" "lion car" 1
> "3" "USA" "car bird" 1
> "3" "USA" "car lion" 1
STEP 2 : order the 2-grams
> "1" "CANADA" "dad mum" 1
> "1" "CANADA" "child dad" 1
> "1" "CANADA" "dad mum" 1
> "1" "CANADA" "child mum" 1
> "1" "CANADA" "child dad" 1
> "1" "CANADA" "child mum" 1
> "2" "CANADA" "dad mum" 1
> "2" "CANADA" "dad dog" 1
> "2" "CANADA" "dad mum" 1
> "2" "CANADA" "dog mum" 1
> "2" "CANADA" "dad dog" 1
> "2" "CANADA" "dog mum" 1
> "3" "USA" "bird lion" 1
> "3" "USA" "bird car" 1
> "3" "USA" "bird lion" 1
> "3" "USA" "car lion" 1
> "3" "USA" "bird car" 1
> "3" "USA" "car lion" 1
STEP 3 : distinct by ID, COUNTRY, 2-ngrams
> "1" "CANADA" "dad mum"
> "1" "CANADA" "child dad"
> "1" "CANADA" "child mum"
> "2" "CANADA" "dad mum"
> "2" "CANADA" "dad dog"
> "2" "CANADA" "dog mum"
> "3" "USA" "bird lion"
> "3" "USA" "bird car"
> "3" "USA" "car lion"
STEP 4 : count by COUNTRY, 2-ngrams
> "CANADA" "dad mum" 2
> "CANADA" "child dad" 1
> "CANADA" "child mum" 1
> "CANADA" "dad dog" 1
> "CANADA" "dog mum" 1
> "USA" "bird lion" 1
> "USA" "bird car" 1
> "USA" "car lion" 1
THANK YOU VERY MUCH
with cte as
(
select t.ID
,t.COUNTRY
,pe.pos
,pe.val
from mytable t
lateral view posexplode (split(VAR1,'\\s+')) pe
)
select t1.COUNTRY
,concat_ws(' ',t1.val,t2.val) as combination
,count (*) as cnt
from cte t1
join cte t2
on t2.id =
t1.id
where t1.pos < t2.pos
group by t1.COUNTRY
,t1.val
,t2.val
;
+----------+--------------+------+
| country | combination | cnt |
+----------+--------------+------+
| CANADA | dad child | 1 |
| CANADA | dad dog | 1 |
| CANADA | dad mum | 2 |
| CANADA | mum child | 1 |
| CANADA | mum dog | 1 |
| USA | bird car | 1 |
| USA | bird lion | 1 |
| USA | lion car | 1 |
+----------+--------------+------+