Search code examples
hadoopapache-pig

how tokenize works in PIG?


I have a data in a file mentioned below

USA USA EUROPE EUROPE EUROPE EUROPE USA USA USA EUROPE EUROPE USA EUROPE USA

I'm trying to find out number of USA and EUROPE.

1) inp = LOAD '/user/countries.txt' as (singleline); 
dump inp;

Output  

(USA USA EUROPE EUROPE EUROPE EUROPE USA)
(USA USA EUROPE EUROPE USA)
(EUROPE USA)

Is this output in tuple...?

2) tknz = FOREACH inp GENERATE TOKENIZE(singleline) as Col_Words;
dump tknz;

Output

{(USA),(USA),(EUROPE),(EUROPE),(EUROPE),(EUROPE),(USA)}
{(USA),(USA),(EUROPE),(EUROPE),(USA)}
{(EUROPE),(USA)}

How this output smiliar is similar to tokenize defination..?

Defination says "split a string of words (all words in a single tuple)" INTO "a bag of words (each word in a single tuple)"

INTO a bag of words statement in definition looks similar with the output but am not able to catch the meaning of "split a string of words (all words in a single tuple)" in definition when I relate my output with definition.

Where all words are in single tuple..?

Tokenize definition, "Use the TOKENIZE function to split a string of words (all words in a single tuple) into a bag of words (each word in a single tuple). The following characters are considered to be word separators: space, double quote("), coma(,) parenthesis(()), star(*)."

Any help...?


Solution

  • You need to use FLATTEN with TOKENIZE to unnest bags/tuples.

    tknz = FOREACH inp GENERATE FLATTEN(TOKENIZE(singleline)) as Col_Words;
    tknz_group = GROUP tknz ALL;
    tknz_count = FOREACH tnnz_group GENERATE group,COUNT(tknz.Col_Words);