New to streamsets. Following the documentation tutorial, was getting
FileNotFound: ... HADOOPFS_14 ... (permission denied)
error when trying to set the destination location as a local FS
directory and preview the pipeline (basically saying either the file can't be accessed or does not exist), yet the permissions for the directory in question are drwxrwxr-x. 2 mapr mapr
. Eventually found workaround by setting the destination folder permissions to be publicly writable ($chmod o+w /path/to/dir
). Yet, the user that started the sdc
service (while I was following the installation instructions) should have had write permissions on that directory (was root
).
I set the sdc user env. vars. to use the name "mapr" (the owner of the directories I'm trying to access), so why did I get rejected? What is happening here when I set the env. vars. for sdc (because it does not seem to be doing anything)?
This is a snippet of what my /opt/streamsets-datacollector/libexec/sdcd-env.sh
file looks like:
# user that will run the data collector, it must exist in the system
#
export SDC_USER=mapr
# group of the user that will run the data collector, it must exist in the system
#
export SDC_GROUP=mapr
So my question is, what determines the permissions for the sdc
service (which I assume is what is being used to access FS locations by the streamsets web UI)? Any explaination or links to specific documentation would be appreciated. Thanks.
Looking at the command ps -ef | grep sdc
to examine who the system thinks the owner of the sdc process really is, found that it was listed as:
sdc 36438 36216 2 09:04 ? 00:01:28 /usr/bin/java -classpath /opt/streamsets-datacollector
So it seems that editing sdcd-env.sh
did not have any effect. What did work was editing the /usr/lib/systemd/system/sdc.service
file to look like (notice that have set user and group to be the user that owns the directories to be used in the streamsets pipeline):
[Unit]
Description=StreamSets Data Collector (SDC)
[Service]
User=mapr
Group=mapr
LimitNOFILE=32768
Environment=SDC_CONF=/etc/sdc
Environment=SDC_HOME=/opt/streamsets-datacollector
Environment=SDC_LOG=/var/log/sdc
Environment=SDC_DATA=/var/lib/sdc
ExecStart=/opt/streamsets-datacollector/bin/streamsets dc -verbose
TimeoutSec=60
Then restarting the sdc service (with systemctl start sdc
, on centos 7) showed:
mapr 157013 156955 83 10:38 ? 00:01:08 /usr/bin/java -classpath /opt/streamsets-datacollector...
and was able to validate and run pipelines with origins and destinations on local FS that are owned by the user and group set in the sdc.service
file.
* NOTE: the specific directories used in the initial post are hadoop-mapr directories mounted via NFS (mapr 6.0) (though the fact that they are NFS should mean that this solution should apply generally) hosted on nodes running centos 7.