Search code examples
hadoophdfshdp

Do users need to exist across all nodes to be recognized by the hadoop cluster / HDFS?


In MapR hadoop, in order for a user to be able to access HDFS or use YARN for programs, they needed to exist across all nodes in the cluster (with same uid and gid), this includes client nodes that don't act as either data nodes or control nodes (MapR does not really have the concept of namenodes). Is this the same for Hortonworks HDP?


Solution

  • Found this answer on the Hortonworks community site:

    User should not have account on all the nodes of the cluster. He should only have account on edge node.

    For a new user there are 2 types are directories we need to create before the user access the cluster.

    1- User home directory [directory created on Linux Filesystem ie. /home/]

    2- User HDFS directory [directory created on HDFS filesystem ie. /user/]

    ...you only need to create HDFS home directory[ie. /user/] on edge node [not sure the meaning here since HDFS does not seem to have anything to do with any particular edge node]. You can still run jobs with the new user on cluster, even if you haven't created his home directory in linux.

    ** Update: Based on comments by user @cricket_007, it appears that the user must also exist on the namenode server as well. The closest I could find to docs explicitly stating this says:

    Each file or directory operation passes the full path name to the NameNode, and the permissions checks are applied along the path for each operation. The client framework will implicitly associate the user identity with the connection to the NameNode, reducing the need for changes to the existing client API. [...] For instance, when the client first begins reading a file, it makes a first request to the NameNode to discover the location of the first blocks of the file.