Search code examples
streamsets

Can't access non-public directories on local FS in streamsets pipeline creator


New to streamsets. Following the documentation tutorial, was getting

FileNotFound: ... HADOOPFS_14 ... (permission denied)

error when trying to set the destination location as a local FS directory and preview the pipeline (basically saying either the file can't be accessed or does not exist), yet the permissions for the directory in question are drwxrwxr-x. 2 mapr mapr. Eventually found workaround by setting the destination folder permissions to be publicly writable ($chmod o+w /path/to/dir). Yet, the user that started the sdc service (while I was following the installation instructions) should have had write permissions on that directory (was root).

I set the sdc user env. vars. to use the name "mapr" (the owner of the directories I'm trying to access), so why did I get rejected? What is happening here when I set the env. vars. for sdc (because it does not seem to be doing anything)?

This is a snippet of what my /opt/streamsets-datacollector/libexec/sdcd-env.sh file looks like:

# user that will run the data collector, it must exist in the system
#
export SDC_USER=mapr

# group of the user that will run the data collector, it must exist in the system
#
export SDC_GROUP=mapr

So my question is, what determines the permissions for the sdc service (which I assume is what is being used to access FS locations by the streamsets web UI)? Any explaination or links to specific documentation would be appreciated. Thanks.


Solution

  • Looking at the command ps -ef | grep sdc to examine who the system thinks the owner of the sdc process really is, found that it was listed as:

    sdc    36438  36216  2 09:04 ?    00:01:28 /usr/bin/java -classpath /opt/streamsets-datacollector
    

    So it seems that editing sdcd-env.sh did not have any effect. What did work was editing the /usr/lib/systemd/system/sdc.service file to look like (notice that have set user and group to be the user that owns the directories to be used in the streamsets pipeline):

    [Unit]
    Description=StreamSets Data Collector (SDC)
    
    [Service]
    User=mapr
    Group=mapr
    LimitNOFILE=32768
    Environment=SDC_CONF=/etc/sdc
    Environment=SDC_HOME=/opt/streamsets-datacollector
    Environment=SDC_LOG=/var/log/sdc
    Environment=SDC_DATA=/var/lib/sdc
    ExecStart=/opt/streamsets-datacollector/bin/streamsets dc -verbose
    TimeoutSec=60
    

    Then restarting the sdc service (with systemctl start sdc, on centos 7) showed:

    mapr    157013 156955 83 10:38 ?    00:01:08 /usr/bin/java -classpath /opt/streamsets-datacollector...
    

    and was able to validate and run pipelines with origins and destinations on local FS that are owned by the user and group set in the sdc.service file.

    * NOTE: the specific directories used in the initial post are hadoop-mapr directories mounted via NFS (mapr 6.0) (though the fact that they are NFS should mean that this solution should apply generally) hosted on nodes running centos 7.