Search code examples
pythonhivebotoelastic-map-reduce

Python client support for running Hive on top of Amazon EMR


I've noticed that neither mrjob nor boto supports a Python interface to submit and run Hive jobs on Amazon Elastic MapReduce (EMR). Are there any other Python client libraries that supports running Hive on EMR?


Solution

  • With boto you can do something like this:

    args1 = [u's3://us-east-1.elasticmapreduce/libs/hive/hive-script',
             u'--base-path',
             u's3://us-east-1.elasticmapreduce/libs/hive/',
             u'--install-hive',
             u'--hive-versions',
             u'0.7']
    args2 = [u's3://us-east-1.elasticmapreduce/libs/hive/hive-script',
             u'--base-path',
             u's3://us-east-1.elasticmapreduce/libs/hive/',
             u'--hive-versions',
             u'0.7',
             u'--run-hive-script',
             u'--args',
             u'-f',
             s3_query_file_uri]
    steps = []
    for name, args in zip(('Setup Hive','Run Hive Script'),(args1,args2)):
        step = JarStep(name,
                       's3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar',
                       step_args=args,
                       #action_on_failure="CANCEL_AND_WAIT"
                       )
        #should be inside loop
        steps.append(step)
    # Kick off the job
    jobid = EmrConnection().run_jobflow(name, s3_log_uri,
                                       steps=steps,
                                       master_instance_type=master_instance_type,
                                       slave_instance_type=slave_instance_type,
                                       num_instances=num_instances,
                                       hadoop_version="0.20")