Streamsets Mapr FS origin/dest. KerberosPrincipal exception (using hadoop impersonation (in mapr 6.0))

I am trying to do a simple data move from a mapr fs origin to a mapr fs destination (this is not my use case, just doing this simple movement for testing purposes). When trying to validate this pipeline, the error message I see in the staging area is:

HADOOPFS_11 - Cannot connect to the filesystem. Check if the Hadoop FS location: 'maprfs:///mapr/mycluster.cluster.local' is valid or not: 'java.io.IOException: Provided Subject must contain a KerberosPrincipal

Tyring different variations of the hadoop fs URI field (eg. mfs:///mapr/mycluster.cluster.local, maprfs:///mycluster.cluster.local) does not seem to help. Looking at the logs after trying to validate, I see

2018-01-04 10:28:56,686     mfs2mfs/mapr2sqlserver850bfbf0-6dc0-4002-8d44-b73e33fcf9b3  INFO    Created source of type: com.streamsets.pipeline.stage.origin.maprfs.ClusterMapRFSSource@16978460    DClusterSourceOffsetCommitter   *admin      preview-pool-1-thread-3

2018-01-04 10:28:56,697     mfs2mfs/mapr2sqlserver850bfbf0-6dc0-4002-8d44-b73e33fcf9b3  INFO    Error connecting to FileSystem: java.io.IOException: Provided Subject must contain a KerberosPrincipal  ClusterHdfsSource   *admin      preview-pool-1-thread-3

java.io.IOException: Provided Subject must contain a KerberosPrincipal
....

2018-01-04 10:20:39,159     mfs2mfs/mapr2mapr850bfbf0-6dc0-4002-8d44-b73e33fcf9b3   INFO    Authentication Config:  ClusterHdfsSource   *admin      preview-pool-1-thread-3

2018-01-04 10:20:39,159     mfs2mfs/mapr2mapr850bfbf0-6dc0-4002-8d44-b73e33fcf9b3   ERROR   Issues: Issue[instance='MapRFS_01' service='null' group='HADOOP_FS' config='null' message='HADOOPFS_11 - Cannot connect to the filesystem. Check if the Hadoop FS location: 'maprfs:///mapr/mycluster.cluster.local' is valid or not: 'java.io.IOException: Provided Subject must contain a KerberosPrincipal'']    ClusterHdfsSource   *admin      preview-pool-1-thread-3

2018-01-04 10:20:39,169     mfs2mfs/mapr2mapr850bfbf0-6dc0-4002-8d44-b73e33fcf9b3   INFO    Validation Error: Failed to configure or connect to the 'maprfs:///mapr/mycluster.cluster.local' Hadoop file system: java.io.IOException: Provided Subject must contain a KerberosPrincipal     HdfsTargetConfigBean    *admin  0   preview-pool-1-thread-3

java.io.IOException: Provided Subject must contain a KerberosPrincipal
....

However, to my knowledge, the system is not running Keberos, so this error message is a bit confusing for me. Uncommenting #export SDC_JAVA_OPTS="-Dmaprlogin.password.enabled=true ${SDC_JAVA_OPTS}" in the sdc environment variable file for native mapr authentication did not seem to help the problem (even when reinstalling and commenting this line before running the streamsets mapr setup script).

Does anyone have any idea what is happening and how to fix it? Thanks.

Solution

This answer was provided on the mapr community forums and worked for me (using mapr v6.0). Note that the instruction here differ from those currently provided by the streamsets documentation. Throughout these instructions, I was logged in as user root.

After installing streamsets (and the mapr prerequisites) as per the documentation...

Change the owner of the the streamsets $SDC_DIST or $SDC_HOME location to the mapr user (or whatever other user you plan to use for the hadoop impersonation): $chown -R mapr:mapr $SDC_DIST (for me this was the /opt/streamsets-datacollector dir.). Do the same for $SDC_CONF (/etc/sdc for me) as well as /var/lib/sdc and var/log/sdc.

In $SDC_DIST/libexec/sdcd-env.sh, set the user and group name (near the top of the file) to mapr user "mapr" and enable mapr password login. The file should end up looking like:

# user that will run the data collector, it must exist in the system
#
export SDC_USER=mapr

# group of the user that will run the data collector, it must exist in the system
#
export SDC_GROUP=mapr
....
# Indicate that MapR Username/Password security is enabled
export SDC_JAVA_OPTS="-Dmaprlogin.password.enabled=true ${SDC_JAVA_OPTS}

Edit the file /usr/lib/systemd/system/sdc.service to look like:
```
[Service] 
User=mapr 
Group=mapr
```
$cd into /etc/systemd/system/ and create a directory called sdc.service.d. Within that directory, create a file (with any name) and add the contents (without spaces):
```
Environment=SDC_JAVA_OPTS=-Dmaprlogin.passowrd.enabled=true
```
If you are using mapr's sasl ticket auth. system (or something similar), generate a ticket for the this user on the node that is running streamsets. In this case, with the $maprlogin password command.
Then finally, restart the sdc service: $systemctl deamon-reload then $systemctl retart sdc.

Run something like $ps -aux | grep sdc | grep maprlogin to check that the sdc process is ownned by mapr and that the -Dmaprlogin.passowrd.enabled=true parameter has been successfully set. Once this is done, should be able to validate/run maprFS to maprFS operations in streamsets pipeline builder in batch processing mode.

** NOTE: If using Hadoop Configuration Directory param. instead of Hadoop FS URI, remember to have the files in your $HADOOP_HOME/conf directory (eg.hadoop-site.xml, yarn-site.xml, etc.) (in the case of mapr, something like /opt/mapr/hadoop/hadoop-<version>/etc/hadoop/) either soft-linked or hard-copied to a directory $SDC_DIST/resource/<some hadoop config dir. you made need to create> (I just copy eberything in the directory) and add this path to the Hadoop Configuration Directory param. for your MaprFS (or HadoopFS). In the sdc web UI Hadoop Configuration Directory box, it would look like Hadoop Configuration Directory: <the directory within $SDC_DIST/resources/ that holds the hadoop files>.

** NOTE: If you are still logging errors of the form

 2018-01-16 14:26:10,883
    ingest2sa_demodata_batch/ingest2sademodatabatchadca8442-cb00-4a0e-929b-df2babe4fd41
    ERROR   Error in Slave Runner:  ClusterRunner   *admin
        runner-pool-2-thread-29
 com.streamsets.datacollector.runner.PipelineRuntimeException:
 CONTAINER_0800 - Pipeline
 'ingest2sademodatabatchadca8442-cb00-4a0e-929b-df2babe4fd41'
 validation error : HADOOPFS_11 - Cannot connect to the filesystem.
 Check if the Hadoop FS location: 'maprfs:///' is valid or not:
 'java.io.IOException: Provided Subject must contain a
 KerberosPrincipal'

you may also need to add -Dmaprlogin.password.enabled=true to the pipeline's /cluster/Worker Java Options tab for the origin and destination hadoop FS stages.

** The video linked to in the mapr community link also says to generate a mapr ticket for the sdc user (the default user that sdc process runs as when running as a service), but I did not do this and the solution still worked for me (so if anyone has any idea why it should be done regardless, please let me know in the comments).