Search code examples
pysparkgoogle-cloud-platformgoogle-cloud-dataprocgsutilgoogle-cloud-datalab

Dataproc PySpark Workers Have no Permission to Use gsutil


Under Dataproc I setup a PySpark cluster with 1 Master Node and 2 Workers. In bucket I have directories of sub-directories of files.

In the Datalab notebook I run

import subprocess
all_parent_direcotry = subprocess.Popen("gsutil ls gs://parent-directories ",shell=True,stdout=subprocess.PIPE).stdout.read()

This gives me all the sub-directories with no problem.

Then I hope to gsutil ls all the files in the sub-directories, so in master node I got:

def get_sub_dir(path):
    import subprocess
    p = subprocess.Popen("gsutil ls gs://parent-directories/" + path, shell=True,stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    return p.stdout.read(), p.stderr.read()

and run get_sub_dir(sub-directory), this gives all files with no problem.

However,

 sub_dir = sc.parallelize([sub-directory])
 sub_dir.map(get_sub_dir).collect()

gives me:

 Traceback (most recent call last):
  File "/usr/bin/../lib/google-cloud-sdk/bin/bootstrapping/gsutil.py", line 99, in <module>
    main()
  File "/usr/bin/../lib/google-cloud-sdk/bin/bootstrapping/gsutil.py", line 30, in main
    project, account = bootstrapping.GetActiveProjectAndAccount()
  File "/usr/lib/google-cloud-sdk/bin/bootstrapping/bootstrapping.py", line 205, in GetActiveProjectAndAccount
    project_name = properties.VALUES.core.project.Get(validate=False)
  File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/properties.py", line 1373, in Get
    required)
  File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/properties.py", line 1661, in _GetProperty
    value = _GetPropertyWithoutDefault(prop, properties_file)
  File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/properties.py", line 1699, in _GetPropertyWithoutDefault
    value = callback()
  File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/credentials/store.py", line 222, in GetProject
    return c_gce.Metadata().Project()
  File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce.py", line 203, in Metadata
    _metadata_lock.lock(function=_CreateMetadata, argument=None)
  File "/usr/lib/python2.7/mutex.py", line 44, in lock
    function(argument)
  File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce.py", line 202, in _CreateMetadata
    _metadata = _GCEMetadata()
  File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce.py", line 59, in __init__
    self.connected = gce_cache.GetOnGCE()
  File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce_cache.py", line 141, in GetOnGCE
    return _SINGLETON_ON_GCE_CACHE.GetOnGCE(check_age)
  File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce_cache.py", line 81, in GetOnGCE
    self._WriteDisk(on_gce)
  File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce_cache.py", line 113, in _WriteDisk
    with files.OpenForWritingPrivate(gce_cache_path) as gcecache_file:
  File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/util/files.py", line 715, in OpenForWritingPrivate
    MakeDir(full_parent_dir_path, mode=0700)
  File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/util/files.py", line 115, in MakeDir
    (u'Please verify that you have permissions to write to the parent '
googlecloudsdk.core.util.files.Error: Could not create directory [/home/.config/gcloud]: Permission denied.

Please verify that you have permissions to write to the parent directory.

After checking, on the worker nodes with whoami, it shows yarn.

So the question is, how to authorize yarn to use gsutil, or is there any other ways to access bucket from the Dataproc PySpark Worker nodes?


Solution

  • The CLI looks at the current homedir for a location to place a cached credential file when it fetches a token from the metadata service. The relevant code in googlecloudsdk/core/config.py looks like this:

    def _GetGlobalConfigDir():
      """Returns the path to the user's global config area.
    
      Returns:
        str: The path to the user's global config area.
      """
      # Name of the directory that roots a cloud SDK workspace.
      global_config_dir = encoding.GetEncodedValue(os.environ, CLOUDSDK_CONFIG)
      if global_config_dir:
        return global_config_dir
      if platforms.OperatingSystem.Current() != platforms.OperatingSystem.WINDOWS:
        return os.path.join(os.path.expanduser('~'), '.config',
                            _CLOUDSDK_GLOBAL_CONFIG_DIR_NAME)
    

    For things running in YARN containers, despite being run as user yarn, where if you just run sudo su yarn you'll see ~ resolve to /var/lib/hadoop-yarn on a Dataproc node, YARN actually propagates yarn.nodemanager.user-home-dir as the container's homedir, and this defaults to /home/. For this reason, even though you can sudo -u yarn gsutil ..., it doesn't behave the same way as gsutil in a YARN container, and naturally, only root is able to create directories in the base /home/ directory.

    Long story short, you have two options:

    1. In your code, add HOME=/var/lib/hadoop-yarn right before your gsutil statement.

    Example:

       p = subprocess.Popen("HOME=/var/lib/hadoop-yarn gsutil ls gs://parent-directories/" + path, shell=True,stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    
    1. When creating a cluster, specify the YARN properties.

    Example:

    gcloud dataproc clusters create --properties yarn:yarn.nodemanager.user-home-dir=/var/lib/hadoop-yarn ...
    

    For an existing cluster, you could also manually add the config to /etc/hadoop/conf/yarn-site.xml on all your workers and then reboot the worker machines (or just run sudo systemctl restart hadoop-yarn-nodemanager.service) but that can be a hassle to manually run on all worker nodes.