I'm having trouble with my Python UDF for use in Pig scripts. I believe the problem is that I assumed my input deltas is in a format it's not actually in, but I'm not sure how to fix it (Python n00b).
Note: On Cloudera (cdh4.3) distro of Hadoop v.2.0.0, Pig v.0.11.0, Python 2.4.3.
import org.apache.pig.impl.logicalLayer.schema.SchemaUtil as SchemaUtil
@outputSchema("adj:float")
def cumRelFreqAdj(deltas):
# create bins of increment 0.01
a = [i*-0.01 for i in range(100)]
a = a[1:len(a)]
b = [i*0.01 for i in range(101)]
a.extend(b)
a.sort()
bins = a
# build cumulative relative frequency distribution
cumfreq = [0]*200
for delta in deltas:
for bin in range(len(bins)):
if delta <= bins[bin]:
cumfreq[bin] += 1
cumrelfreq = [float(cumfreq[i]) / max(cumfreq) for i in range(len(cumfreq))]
crf = zip(bins, cumrelfreq)
for relfreq in crf[:]:
if relfreq[1] > 0.11: # 10%ile
adj = relfreq[0] + 0.05
break
return adj
Do I need to convert my input to a list first?
Answered my own question. The input from Pig is a bag of tuples. In my case each tuple has one element, e.g.: {(-0.01), (-0.03), (0.00001), (-0.2383), (0.158)}.
So in order to compare it to a float-type element from another list bins, I need to insert something like:
delta = list(delta)[0]
between lines 16 & 17 above, to pull out the float-typed data element that is the content of the tuple. Then the comparison on line 18 will work.