Search code examples
apache-pigpig-udfbigdata

Pig udf efficiency against cascaded built in fucntions


I am new to PIG scripting, I had a requirement where I needed to perform Ladder If Else for upto 10 conditions, From what knowledge I have we only have ternary operator, so i was thinking to write a UDF, insted of cascading the ternary operator like below :- ( condition : statement1 ? ( condition : statement 2 ? statement 3 ))

The data size is in tens of million rows, Should i even proceed with putting an effort in creating a UDF for my requirement.?

As in the end if it causes performance problems there will be no point in putting an effort.

From what i know, a call to the UDF will be made for each row in consideration, and a recursive call on a Million records is a serious overhead.


Solution

  • I think if you have access for a big cluster the UDF should't be a problem and it's improve the readability of your script. At the end your script also compiled to a java executable. The biggest win on the performance if you can filter your data before the expensive operations.