Search code examples
hivehadoop-yarnpyodbcgoogle-cloud-dataprocpyhive

How to make Dataproc detect Python-Hive connection as a Yarn Job?


I launch a Dataproc cluster and serve Hive on it. Remotely from any machine I use Pyhive or PyODBC to connect to Hive and do things. It's not just one query. It can be a long session with intermittent queries. (The query itself has issues; will ask separately.)

Even during one single, active query, the operation does not show as a "Job" (I guess it's Yarn) on the dashboard. In contrast, when I "submit" tasks via Pyspark, they show up as "Jobs".

Besides the lack of task visibility, I also suspect that, w/o a Job, the cluster may not reliably detect a Python client is "connected" to it, hence the cluster's auto-delete might kick in prematurely.

Is there a way to "register" a Job to companion my Python session, and cancel/delete the job at times of my choosing? For my case, it is a "dummy", "nominal" job that does nothing.

Or maybe there's a more proper way to let Yarn detect my Python client's connection and create a job for it?

Thanks.


Solution

  • This is not supported right now, you need to submit jobs via Dataproc Jobs API to make them visible on jobs UI page and to be taken into account by cluster TTL feature.

    If you can not use Dataproc Jobs API to execute your actual jobs, then you can submit a dummy Pig job that sleeps for desired time (5 hours in the example below) to prevent cluster deletion by max idle time feature:

    gcloud dataproc jobs submit pig --cluster="${CLUSTER_NAME}" \
        --execute="sh sleep $((5 * 60 * 60))"