I am new to PIG, I am looking for help. I have two files A(templates) and B(data) both are having huge unstructured content. Agenda is to traverse the file B(data) and find the count against each template(line) of file A.
I think it should work in a loop with the nested statement but I do not know how can I achieve the same in pig.
example:-
file1.txt
hello ravi
hi mohit
bye sameer
hi mohit
hi abc
hello cds
hi assaad
file2.txt
hi mohit
hi assaad
I need a count of file2 both lines. The expected output may look like:-
hi mohit: 2
hi assaad: 1
Please do let me know.
Lets start by loading both your datasets:
data = LOAD 'file1.txt' AS (line:chararray);
templates = LOAD 'file2.txt' AS (template:chararray);
Now we essentially need to JOIN the above relations on the templates. Once joined, we can GROUP on the template to get counts for each template. However that would required 2 map-reduce stage, one for the JOIN and one for the GROUP BY. Here is where you can use COGROUP. It is an extremely useful operation and you can read more about it here: https://www.tutorialspoint.com/apache_pig/apache_pig_cogroup_operator.htm
cogroupedData = COGROUP data BY line, templates BY template;
templateLines = FILTER cogroupedData BY (NOT ISEmpty(templates));
templateCounts = FOREACH templateLines GENERATE
group AS template,
COUNT(data.line) AS templateCount;
DUMP templateCounts;
What COGROUP does is essentially similar to a JOIN and then a GROUP BY on the same key (template in this case). It takes only one map-reduce stage. The filter applied above is to remove records which did not have a template in file2.txt