Search code examples
hadoophivepyhive

PyHive ignoring Hive config


I'm intermittently getting the error message

DAG did not succeed due to VERTEX_FAILURE.

when running Hive queries via PyHive. Hive is running on an EMR cluster where hive.vectorized.execution.enabled is set to false in the hive-site.xml file for this reason.

I can set the above property through the configuration on the Hive connection and my query has run successfully every time I've executed it, however I want to confirm that this has fixed the issue and that it is definitely the case that hive-site.xml is being ignored.

Can anyone confirm if this is the expected behavior, or alternatively is there any way to inspect the Hive configuration via PyHive as I've not been able to find any way of doing this?

Thanks!


Solution

  • PyHive is a thin client that connects to HiveServer2, just like a Java or C client (via JDBC or ODBC). It does not use any Hadoop configuration files on your local machine. The HS2 session starts with whatever properties are set server-side.
    Same goes for ImPyla BTW.

    So it's your responsibility to set custom session properties from your Python code, e.g. execute this statement...
    SET hive.vectorized.execution.enabled =False
    ... before running your SELECT.