I'm trying to use the map clause with Hive but I'm tripping over syntax and not finding many examples of my use case around. I used the map clause before when I had to process one of the columns of a table using an external script.
I had a python script called, say, run
, that took one command line parameter and spit out three space separated values. So I just did:
FROM(MAP
tablename.columnName
USING
'run' AS
result1, result2, result3
FROM
tablename
) map_output
INSERT OVERWRITE TABLE results SELECT *;
Now I have a python script that receives a lot more parameters and tried a few things that didn't worked and couldn't find examples on this. I did the obvious thing:
FROM
(MAP
numAgents, alpha, beta, burnin, nsteps, thin
USING
'runAuthorityMCMC' AS numAgents, alpha, beta, energy, avgDegree, maxDegree, accept
FROM
parameters
) map_output
INSERT OVERWRITE TABLE results SELECT *;
But I got an error A user-supplied transfrom script has exited with error code 2 instead of 0.
When I run runAuthorityMCMC
, with 6 command line parameters sampled from that table, it works perfectly well.
It seems to me it's trying to run the script without passing the parameters at all. In one of the error messages I got exactly the output I expected if this was the case. What is the correct syntax to do what I'm trying to do?
EDIT:
Confirming - this was part of the error message:
usage: runAuthorityMCMC [-h]
numAgents normalizedBrainCapacity ecologicalPressure
burnInSteps monteCarloSteps thiningRatio
runAuthorityMCMC: error: too few arguments
Which is exactly the output I'd expect with too few arguments. The script should take six arguments.
Ok, perhaps there is a difference of vocabulary here but hive doesn't send the values as "arguments" to the script. They are read in through standard input (which is different than passing something as argument). Also, you can try sending the data to /bin/cat
so see what's actually being sent to the hive. If my memory serves me right, the values are sent tab separated and result emitted out from the script is also expected to be tab separated.
Trying printing stuff from stdout (or stderr) in your script, you will see the result in your jobtracker logs. That will help you debug.