Search code examples
hivewords

ngrams combination of words hive


I have a table with 1000 rows and 3 variables (an ID, a COUNTRY and a string variable "VAR1"). VAR1 is a sentence composed by words separated by a space.

I want, by COUNTRY, a count of all couples (or all triplets) of words. Very important the couples (or triplets) are a cross-over of all words (not necessarily step by step). Perhaps, we can do it with an ngrams hql-function but when i use it, it's counting words step by step and not all the crossing.

Let's go for the example to give you an idea of what i want :

> **"ID" "COUNTRY" "VAR1"**
> "1" "CANADA" "dad mum child"
> "2" "CANADA" "dad mum dog"
> "3" "USA"    "bird lion car"

VAR1 is not necessarily a length of 3 words. It's just to simplify.

The resutls in 4 steps that i want for a 2-ngrams:

STEP 1 : THE MOST IMPORTANT STEP : crossing words by 2

> "1" "CANADA" "dad mum" 1
> "1" "CANADA" "dad child" 1
> "1" "CANADA" "mum dad" 1
> "1" "CANADA" "mum child" 1
> "1" "CANADA" "child dad" 1
> "1" "CANADA" "child mum" 1
> "2" "CANADA" "dad mum" 1
> "2" "CANADA" "dad dog" 1
> "2" "CANADA" "mum dad" 1
> "2" "CANADA" "mum dog" 1
> "2" "CANADA" "dog dad" 1
> "2" "CANADA" "dog mum" 1
> "3" "USA"    "bird lion" 1
> "3" "USA"    "bird car" 1
> "3" "USA"    "lion bird" 1
> "3" "USA"    "lion car" 1
> "3" "USA"    "car bird" 1
> "3" "USA"    "car lion" 1 

STEP 2 : order the 2-grams

> "1" "CANADA" "dad mum" 1
> "1" "CANADA" "child dad" 1
> "1" "CANADA" "dad mum" 1
> "1" "CANADA" "child mum" 1
> "1" "CANADA" "child dad" 1
> "1" "CANADA" "child mum" 1
> "2" "CANADA" "dad mum" 1
> "2" "CANADA" "dad dog" 1
> "2" "CANADA" "dad mum" 1
> "2" "CANADA" "dog mum" 1
> "2" "CANADA" "dad dog" 1
> "2" "CANADA" "dog mum" 1
> "3" "USA"    "bird lion" 1
> "3" "USA"    "bird car" 1
> "3" "USA"    "bird lion" 1
> "3" "USA"    "car lion" 1
> "3" "USA"    "bird car" 1
> "3" "USA"    "car lion" 1 

STEP 3 : distinct by ID, COUNTRY, 2-ngrams

> "1" "CANADA" "dad mum"
> "1" "CANADA" "child dad"
> "1" "CANADA" "child mum"
> "2" "CANADA" "dad mum"
> "2" "CANADA" "dad dog"
> "2" "CANADA" "dog mum"
> "3" "USA"    "bird lion"
> "3" "USA"    "bird car"
> "3" "USA"    "car lion" 

STEP 4 : count by COUNTRY, 2-ngrams

> "CANADA" "dad mum" 2 
> "CANADA" "child dad" 1 
> "CANADA" "child mum" 1
> "CANADA" "dad dog" 1 
> "CANADA" "dog mum" 1 
> "USA"    "bird lion" 1 
> "USA"    "bird car" 1 
> "USA"    "car lion" 1

THANK YOU VERY MUCH


Solution

  • with        cte as
                (
                    select  t.ID
                           ,t.COUNTRY
                           ,pe.pos
                           ,pe.val
                    from    mytable t
                            lateral view posexplode (split(VAR1,'\\s+')) pe
                )
    
    select      t1.COUNTRY  
               ,concat_ws(' ',t1.val,t2.val)    as combination
               ,count (*)                       as cnt
    
    from                cte t1
    
                join    cte t2
    
                on      t2.id   =
                        t1.id
    
    where       t1.pos < t2.pos
    
    group by    t1.COUNTRY  
               ,t1.val 
               ,t2.val
    ;
    

    +----------+--------------+------+
    | country  | combination  | cnt  |
    +----------+--------------+------+
    | CANADA   | dad child    | 1    |
    | CANADA   | dad dog      | 1    |
    | CANADA   | dad mum      | 2    |
    | CANADA   | mum child    | 1    |
    | CANADA   | mum dog      | 1    |
    | USA      | bird car     | 1    |
    | USA      | bird lion    | 1    |
    | USA      | lion car     | 1    |
    +----------+--------------+------+