Search code examples
hadoophiveelastic-map-reduce

Hive - map not sending parameters to custom map script?


I'm trying to use the map clause with Hive but I'm tripping over syntax and not finding many examples of my use case around. I used the map clause before when I had to process one of the columns of a table using an external script.

I had a python script called, say, run, that took one command line parameter and spit out three space separated values. So I just did:

FROM(MAP
        tablename.columnName
     USING
        'run' AS
        result1, result2, result3
     FROM
        tablename
    ) map_output
    INSERT OVERWRITE TABLE results SELECT *;

Now I have a python script that receives a lot more parameters and tried a few things that didn't worked and couldn't find examples on this. I did the obvious thing:

FROM
    (MAP
        numAgents, alpha, beta, burnin, nsteps, thin
     USING
        'runAuthorityMCMC' AS numAgents, alpha, beta, energy, avgDegree, maxDegree, accept
     FROM
        parameters
    ) map_output
INSERT OVERWRITE TABLE results SELECT *;

But I got an error A user-supplied transfrom script has exited with error code 2 instead of 0. When I run runAuthorityMCMC, with 6 command line parameters sampled from that table, it works perfectly well.

It seems to me it's trying to run the script without passing the parameters at all. In one of the error messages I got exactly the output I expected if this was the case. What is the correct syntax to do what I'm trying to do?

EDIT:

Confirming - this was part of the error message:

usage: runAuthorityMCMC [-h]
                        numAgents normalizedBrainCapacity ecologicalPressure
                        burnInSteps monteCarloSteps thiningRatio
runAuthorityMCMC: error: too few arguments

Which is exactly the output I'd expect with too few arguments. The script should take six arguments.


Solution

  • Ok, perhaps there is a difference of vocabulary here but hive doesn't send the values as "arguments" to the script. They are read in through standard input (which is different than passing something as argument). Also, you can try sending the data to /bin/cat so see what's actually being sent to the hive. If my memory serves me right, the values are sent tab separated and result emitted out from the script is also expected to be tab separated.

    Trying printing stuff from stdout (or stderr) in your script, you will see the result in your jobtracker logs. That will help you debug.