Search code examples
hadooppermissionshivehdfshadoop-yarn

What's the best solution for Hive proxy user in HDFS?


I'm very confusing by the proxyuser setting in HDFS and Hive. I have the doAs option enabled in hive-site.xml

<property>
    <name>hive.server2.enable.doAs</name>
    <value>true</value>
</property>

And proxyuser in core-site.xml

<property>
    <name>hadoop.proxyuser.hdfs.hosts</name>
    <value>*</value>
</property>

<property>
    <name>hadoop.proxyuser.hdfs.groups</name>
    <value>*</value>
</property>

But this will cause:

2017-03-29 16:24:59,022 INFO org.apache.hadoop.ipc.Server: Connection from 172.16.0.239:60920 for protocol org.apache.hadoop.hdfs.protocol.ClientProtocol is unauthorized for user hive (auth:PROXY) via hive (auth:SIMPLE)
2017-03-29 16:24:59,023 INFO org.apache.hadoop.ipc.Server: Socket Reader #1 for port 9000: readAndProcess from client 172.16.0.239 threw exception [org.apache.hadoop.security.authorize.AuthorizationException: User: hive is not allowed to impersonate hive]

I didn't set proxyuser to "hive" like most example saying is because core-site.xml is shared by other services, I don't like every service access HDFS as hive, but I still gave it a try so that now the core-site.xml looks as:

<property>
    <name>hadoop.proxyuser.hive.hosts</name>
    <value>*</value>
</property>

<property>
    <name>hadoop.proxyuser.hive.groups</name>
    <value>*</value>
</property>

I lunched beeline again, however, the login is fine this time, but when a command was running, yarn thrown exception:

Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. Permission denied: user=hive, access=WRITE, inode="/user/yarn/hive/.staging":hdfs:supergroup:drwxr-xr-x

proxyuser "hive" has been denied from the staging folder which is owned by "hdfs". I don't think give 777 to the staging folder is a good idea as it makes no sense to give HDFS protection but open the folder to everyone. So my question is what's the best solution to setup the permission between Hive, Hdfs and Yarn?

Hadoop permission is just a nightmare to me, please help.


Solution

  • Adding proxyuser entries in core-site.xml would allow the superuser named hive to connect from any host (as value is *) to impersonate a user belonging to any group (as value is *).

    <property>
        <name>hadoop.proxyuser.hive.hosts</name>
        <value>*</value>
    </property>
    
    <property>
        <name>hadoop.proxyuser.hive.groups</name>
        <value>*</value>
    </property>
    

    This can be made more restrictive by passing actual hostnames and group names (Refer Superusers). The access privileges the superuser hive has on the FS will be applicable for the impersonating users.

    For a multi-user Hadoop environment, the best practice would be to create dedicated directories for every superuser and configure the associated service to store files in it. And create a group supergroup for all these superusers so that group level access privileges can be given for the files, if required.

    Add this property in hdfs-site.xml to configure the supergroup

    <property>
       <name>dfs.permissions.superusergroup</name>
       <value>supergroup</value> 
       <description>The name of the group of super-users.</description>
    </property>