I have a data in a file mentioned below
USA USA EUROPE EUROPE EUROPE EUROPE USA USA USA EUROPE EUROPE USA EUROPE USA
I'm trying to find out number of USA and EUROPE.
1) inp = LOAD '/user/countries.txt' as (singleline);
dump inp;
Output
(USA USA EUROPE EUROPE EUROPE EUROPE USA)
(USA USA EUROPE EUROPE USA)
(EUROPE USA)
Is this output in tuple...?
2) tknz = FOREACH inp GENERATE TOKENIZE(singleline) as Col_Words;
dump tknz;
Output
{(USA),(USA),(EUROPE),(EUROPE),(EUROPE),(EUROPE),(USA)}
{(USA),(USA),(EUROPE),(EUROPE),(USA)}
{(EUROPE),(USA)}
How this output smiliar is similar to tokenize defination..?
Defination says "split a string of words (all words in a single tuple)" INTO "a bag of words (each word in a single tuple)"
INTO a bag of words statement in definition looks similar with the output but am not able to catch the meaning of "split a string of words (all words in a single tuple)" in definition when I relate my output with definition.
Where all words are in single tuple..?
Tokenize definition, "Use the TOKENIZE function to split a string of words (all words in a single tuple) into a bag of words (each word in a single tuple). The following characters are considered to be word separators: space, double quote("), coma(,) parenthesis(()), star(*)."
Any help...?
You need to use FLATTEN with TOKENIZE to unnest bags/tuples.
tknz = FOREACH inp GENERATE FLATTEN(TOKENIZE(singleline)) as Col_Words;
tknz_group = GROUP tknz ALL;
tknz_count = FOREACH tnnz_group GENERATE group,COUNT(tknz.Col_Words);