Search code examples
ambari

Ambari unable to run custom hook for modifying user hive


Attempting to add a client node to cluster via Ambari (v2.7.3.0) (HDP 3.1.0.0-78) and seeing odd error

stderr: 
Traceback (most recent call last):
  File "/var/lib/ambari-agent/cache/stack-hooks/before-ANY/scripts/hook.py", line 38, in <module>
    BeforeAnyHook().execute()
  File "/usr/lib/ambari-agent/lib/resource_management/libraries/script/script.py", line 352, in execute
    method(env)
  File "/var/lib/ambari-agent/cache/stack-hooks/before-ANY/scripts/hook.py", line 31, in hook
    setup_users()
  File "/var/lib/ambari-agent/cache/stack-hooks/before-ANY/scripts/shared_initialization.py", line 51, in setup_users
    fetch_nonlocal_groups = params.fetch_nonlocal_groups,
  File "/usr/lib/ambari-agent/lib/resource_management/core/base.py", line 166, in __init__
    self.env.run()
  File "/usr/lib/ambari-agent/lib/resource_management/core/environment.py", line 160, in run
    self.run_action(resource, action)
  File "/usr/lib/ambari-agent/lib/resource_management/core/environment.py", line 124, in run_action
    provider_action()
  File "/usr/lib/ambari-agent/lib/resource_management/core/providers/accounts.py", line 90, in action_create
    shell.checked_call(command, sudo=True)
  File "/usr/lib/ambari-agent/lib/resource_management/core/shell.py", line 72, in inner
    result = function(command, **kwargs)
  File "/usr/lib/ambari-agent/lib/resource_management/core/shell.py", line 102, in checked_call
    tries=tries, try_sleep=try_sleep, timeout_kill_strategy=timeout_kill_strategy, returns=returns)
  File "/usr/lib/ambari-agent/lib/resource_management/core/shell.py", line 150, in _call_wrapper
    result = _call(command, **kwargs_copy)
  File "/usr/lib/ambari-agent/lib/resource_management/core/shell.py", line 314, in _call
    raise ExecutionFailed(err_msg, code, out, err)
resource_management.core.exceptions.ExecutionFailed: Execution of 'usermod -G hadoop -g hadoop hive' returned 6. usermod: user 'hive' does not exist in /etc/passwd
Error: Error: Unable to run the custom hook script ['/usr/bin/python', '/var/lib/ambari-agent/cache/stack-hooks/before-ANY/scripts/hook.py', 'ANY', '/var/lib/ambari-agent/data/command-632.json', '/var/lib/ambari-agent/cache/stack-hooks/before-ANY', '/var/lib/ambari-agent/data/structured-out-632.json', 'INFO', '/var/lib/ambari-agent/tmp', 'PROTOCOL_TLSv1_2', '']2019-11-25 13:07:58,000 - Reporting component version failed
Traceback (most recent call last):
  File "/usr/lib/ambari-agent/lib/resource_management/libraries/script/script.py", line 363, in execute
    self.save_component_version_to_structured_out(self.command_name)
  File "/usr/lib/ambari-agent/lib/resource_management/libraries/script/script.py", line 223, in save_component_version_to_structured_out
    stack_select_package_name = stack_select.get_package_name()
  File "/usr/lib/ambari-agent/lib/resource_management/libraries/functions/stack_select.py", line 109, in get_package_name
    package = get_packages(PACKAGE_SCOPE_STACK_SELECT, service_name, component_name)
  File "/usr/lib/ambari-agent/lib/resource_management/libraries/functions/stack_select.py", line 223, in get_packages
    supported_packages = get_supported_packages()
  File "/usr/lib/ambari-agent/lib/resource_management/libraries/functions/stack_select.py", line 147, in get_supported_packages
    raise Fail("Unable to query for supported packages using {0}".format(stack_selector_path))
Fail: Unable to query for supported packages using /usr/bin/hdp-select



 stdout:
2019-11-25 13:07:57,644 - Stack Feature Version Info: Cluster Stack=3.1, Command Stack=None, Command Version=None -> 3.1
2019-11-25 13:07:57,651 - Using hadoop conf dir: /usr/hdp/current/hadoop-client/conf
2019-11-25 13:07:57,652 - Group['livy'] {}
2019-11-25 13:07:57,654 - Group['spark'] {}
2019-11-25 13:07:57,654 - Group['ranger'] {}
2019-11-25 13:07:57,654 - Group['hdfs'] {}
2019-11-25 13:07:57,654 - Group['zeppelin'] {}
2019-11-25 13:07:57,655 - Group['hadoop'] {}
2019-11-25 13:07:57,655 - Group['users'] {}
2019-11-25 13:07:57,656 - User['yarn-ats'] {'gid': 'hadoop', 'fetch_nonlocal_groups': True, 'groups': ['hadoop'], 'uid': None}
2019-11-25 13:07:57,658 - User['hive'] {'gid': 'hadoop', 'fetch_nonlocal_groups': True, 'groups': ['hadoop'], 'uid': None}
2019-11-25 13:07:57,659 - Modifying user hive
Error: Error: Unable to run the custom hook script ['/usr/bin/python', '/var/lib/ambari-agent/cache/stack-hooks/before-ANY/scripts/hook.py', 'ANY', '/var/lib/ambari-agent/data/command-632.json', '/var/lib/ambari-agent/cache/stack-hooks/before-ANY', '/var/lib/ambari-agent/data/structured-out-632.json', 'INFO', '/var/lib/ambari-agent/tmp', 'PROTOCOL_TLSv1_2', '']
2019-11-25 13:07:57,971 - The repository with version 3.1.0.0-78 for this command has been marked as resolved. It will be used to report the version of the component which was installed
2019-11-25 13:07:58,000 - Reporting component version failed
Traceback (most recent call last):
  File "/usr/lib/ambari-agent/lib/resource_management/libraries/script/script.py", line 363, in execute
    self.save_component_version_to_structured_out(self.command_name)
  File "/usr/lib/ambari-agent/lib/resource_management/libraries/script/script.py", line 223, in save_component_version_to_structured_out
    stack_select_package_name = stack_select.get_package_name()
  File "/usr/lib/ambari-agent/lib/resource_management/libraries/functions/stack_select.py", line 109, in get_package_name
    package = get_packages(PACKAGE_SCOPE_STACK_SELECT, service_name, component_name)
  File "/usr/lib/ambari-agent/lib/resource_management/libraries/functions/stack_select.py", line 223, in get_packages
    supported_packages = get_supported_packages()
  File "/usr/lib/ambari-agent/lib/resource_management/libraries/functions/stack_select.py", line 147, in get_supported_packages
    raise Fail("Unable to query for supported packages using {0}".format(stack_selector_path))
Fail: Unable to query for supported packages using /usr/bin/hdp-select

Command failed after 1 tries

The problem appears to be

resource_management.core.exceptions.ExecutionFailed: Execution of 'usermod -G hadoop -g hadoop hive' returned 6. usermod: user 'hive' does not exist in /etc/passwd

caused by

2019-11-25 13:07:57,659 - Modifying user hive
Error: Error: Unable to run the custom hook script ['/usr/bin/python', '/var/lib/ambari-agent/cache/stack-hooks/before-ANY/scripts/hook.py', 'ANY', '/var/lib/ambari-agent/data/command-632.json', '/var/lib/ambari-agent/cache/stack-hooks/before-ANY', '/var/lib/ambari-agent/data/structured-out-632.json', 'INFO', '/var/lib/ambari-agent/tmp', 'PROTOCOL_TLSv1_2', '']

This is further reinforced by the fact that manually adding the ambari-hdp-1.repo and yum-installing hdp-select before adding the host to the cluster shows the same error messages, just truncated up to the parts of stdout/err shown here.

When running

[root@HW001 .ssh]# /usr/bin/hdp-select versions
3.1.0.0-78

from the ambari server node, I can see the command runs.

Looking at what the hook script is trying to run/access, I see

[root@client001~]# ls -lha /var/lib/ambari-agent/cache/stack-hooks/before-ANY/scripts/hook.py
-rw-r--r-- 1 root root 1.2K Nov 25 10:51 /var/lib/ambari-agent/cache/stack-hooks/before-ANY/scripts/hook.py
[root@client001~]# ls -lha /var/lib/ambari-agent/data/command-632.json
-rw------- 1 root root 545K Nov 25 13:07 /var/lib/ambari-agent/data/command-632.json
[root@client001~]# ls -lha /var/lib/ambari-agent/cache/stack-hooks/before-ANY
total 0
drwxr-xr-x 4 root root  34 Nov 25 10:51 .
drwxr-xr-x 8 root root 147 Nov 25 10:51 ..
drwxr-xr-x 2 root root  34 Nov 25 10:51 files
drwxr-xr-x 2 root root 188 Nov 25 10:51 scripts
[root@client001~]# ls -lha /var/lib/ambari-agent/data/structured-out-632.json
ls: cannot access /var/lib/ambari-agent/data/structured-out-632.json: No such file or directory
[root@client001~]# ls -lha /var/lib/ambari-agent/tmp
total 96K
drwxrwxrwt  3 root root 4.0K Nov 25 13:06 .
drwxr-xr-x 10 root root  267 Nov 25 10:50 ..
drwxr-xr-x  6 root root 4.0K Nov 25 13:06 ambari_commons
-rwx------  1 root root 1.4K Nov 25 13:06 ambari-sudo.sh
-rwxr-xr-x  1 root root 1.6K Nov 25 13:06 create-python-wrap.sh
-rwxr-xr-x  1 root root 1.6K Nov 25 10:50 os_check_type1574715018.py
-rwxr-xr-x  1 root root 1.6K Nov 25 11:12 os_check_type1574716360.py
-rwxr-xr-x  1 root root 1.6K Nov 25 11:29 os_check_type1574717391.py
-rwxr-xr-x  1 root root 1.6K Nov 25 13:06 os_check_type1574723161.py
-rwxr-xr-x  1 root root  16K Nov 25 10:50 setupAgent1574715020.py
-rwxr-xr-x  1 root root  16K Nov 25 11:12 setupAgent1574716361.py
-rwxr-xr-x  1 root root  16K Nov 25 11:29 setupAgent1574717392.py
-rwxr-xr-x  1 root root  16K Nov 25 13:06 setupAgent1574723163.py

notice there is ls: cannot access /var/lib/ambari-agent/data/structured-out-632.json: No such file or directory. Not sure if this is normal, though.

Anyone know what could be causing this or any debugging hints from this point?


UPDATE 01: Adding some log printing lines near the offending final line in the error trace, ie. File "/usr/lib/ambari-agent/lib/resource_management/libraries/functions/stack_select.py", line 147, in get_supported_packages, I print the code and stdout:

2
ambari-python-wrap: can't open file '/usr/bin/hdp-select': [Errno 2] No such file or directory

So what the heck? It wants hdp-select to already be there, but ambari add-host UI complains if I manually install that binary myself beforehand. When I do manually install it (using the same repo file as in the rest of the existing cluster nodes) all I see is...

0
Packages:
  accumulo-client
  accumulo-gc
  accumulo-master
  accumulo-monitor
  accumulo-tablet
  accumulo-tracer
  atlas-client
  atlas-server
  beacon
  beacon-client
  beacon-server
  druid-broker
  druid-coordinator
  druid-historical
  druid-middlemanager
  druid-overlord
  druid-router
  druid-superset
  falcon-client
  falcon-server
  flume-server
  hadoop-client
  hadoop-hdfs-client
  hadoop-hdfs-datanode
  hadoop-hdfs-journalnode
  hadoop-hdfs-namenode
  hadoop-hdfs-nfs3
  hadoop-hdfs-portmap
  hadoop-hdfs-secondarynamenode
  hadoop-hdfs-zkfc
  hadoop-httpfs
  hadoop-mapreduce-client
  hadoop-mapreduce-historyserver
  hadoop-yarn-client
  hadoop-yarn-nodemanager
  hadoop-yarn-registrydns
  hadoop-yarn-resourcemanager
  hadoop-yarn-timelinereader
  hadoop-yarn-timelineserver
  hbase-client
  hbase-master
  hbase-regionserver
  hive-client
  hive-metastore
  hive-server2
  hive-server2-hive
  hive-server2-hive2
  hive-webhcat
  hive_warehouse_connector
  kafka-broker
  knox-server
  livy-client
  livy-server
  livy2-client
  livy2-server
  mahout-client
  oozie-client
  oozie-server
  phoenix-client
  phoenix-server
  pig-client
  ranger-admin
  ranger-kms
  ranger-tagsync
  ranger-usersync
  shc
  slider-client
  spark-atlas-connector
  spark-client
  spark-historyserver
  spark-schema-registry
  spark-thriftserver
  spark2-client
  spark2-historyserver
  spark2-thriftserver
  spark_llap
  sqoop-client
  sqoop-server
  storm-client
  storm-nimbus
  storm-slider-client
  storm-supervisor
  superset
  tez-client
  zeppelin-server
  zookeeper-client
  zookeeper-server
Aliases:
  accumulo-server
  all
  client
  hadoop-hdfs-server
  hadoop-mapreduce-server
  hadoop-yarn-server
  hive-server

Command failed after 1 tries

UPDATE 02: Printing some custom logging from File "/usr/lib/ambari-agent/lib/resource_management/core/shell.py", line 322 (printing the values of err_msg, code, out, err), ie.

....
    312   if throw_on_failure and not code in returns:
    313     err_msg = Logger.filter_text("Execution of '{0}' returned {1}. {2}".format(command_alias, c        ode, all_output))
    314
    315     #TODO remove
    316     print("\n----------\nMY LOGS\n----------\n")
    317     print(err_msg)
    318     print(code)
    319     print(out)
    320     print(err)
    321
    322     raise ExecutionFailed(err_msg, code, out, err)
    323
    324   # if separate stderr is enabled (by default it's redirected to out)
    325   if stderr == subprocess32.PIPE:
    326     return code, out, err
    327
    328   return code, out
....

I see

Execution of 'usermod -G hadoop -g hadoop hive' returned 6. usermod: user 'hive' does not exist in /etc/passwd
6
usermod: user 'hive' does not exist in /etc/passwd

Error: Error: Unable to run the custom hook script ['/usr/bin/python', '/var/lib/ambari-agent/cache/stack-hooks/before-ANY/scripts/hook.py', 'ANY', '/var/lib/ambari-agent/data/command-816.json', '/var/lib/ambari-agent/cache/stack-hooks/before-ANY', '/var/lib/ambari-agent/data/structured-out-816.json', 'INFO', '/var/lib/ambari-agent/tmp', 'PROTOCOL_TLSv1_2', '']
2019-11-26 10:25:46,928 - The repository with version 3.1.0.0-78 for this command has been marked as resolved. It will be used to report the version of the component which was installed

So it seems like it is failing to create the hive user (even though it seems to have no problem creating the yarn-ats user before that)


Solution

  • After just giving in and trying to manually create the hive user myself, I see

    [root@airflowetl ~]# useradd -g hadoop -s /bin/bash hive
    useradd: user 'hive' already exists
    [root@airflowetl ~]# cat /etc/passwd | grep hive
    <nothing>
    [root@airflowetl ~]# id hive
    uid=379022825(hive) gid=379000513(domain users) groups=379000513(domain users)
    

    The fact that this existing user's uid looks like this and is not in the /etc/passwd file made me think that there is some existing Active Directory user (which this client node syncs with via installed SSSD) that already has the name hive. Checking our AD users, this turned out to be true.

    Temporarily stopping the SSSD service to stop sync with AD (service sssd stop) (since, not sure if you can get a server to ignore AD syncs on an individual user basis) before rerunning the client host add in Ambari fixed the problem for me.