Search code examples
performanceapache-pig

How to measure performance in pig


I came across two scripts which does the same job of calculating the percentage of the values in pig.

Script1

total = FOREACH (GROUP A ALL) GENERATE COUNT(A);
rows = FOREACH (GROUP A by $0) GENERATE group as colname, COUNT(A) as cnt;
fractions = FOREACH rows GENERATE colname, cnt/(double)total.$0;

Script2

test = LOAD 'test.txt' USING PigStorage(',') AS (one:chararray,two:int);
B = GROUP test by $0;
C = FOREACH B GENERATE group, COUNT(test.$0);
D = GROUP test ALL;
E = FOREACH D GENERATE group,COUNT(test.$0);
F = CROSS C,E;
G = FOREACH F GENERATE $0,(double)($1*100/$3);

From the outset the Script1 is efficient that Script2.

I want to know if there are tools like VisualVM ,JProfiler in Java, to measure the performance of the pig scripts.

The time take to run the script is one way to do measure, but are there tools build for it?


Solution

    • You have written a pig script.
    • Depending on the script Pig Translates this in Optimized Map Reduce.

    Using Explain command figure out the MR plan for both scripts. Compare the plan based on some general rules (There can be variations)

    1. Script which generates less number of Reducers will be faster.
    2. Script which generated less MR jobs will be faster.
    3. In a give MR the script which calls less number of UDF will be faster.