Search code examples
apache-sparkspark-streamingamazon-emrapache-nifi

NiFi Streaming to Spark on EMR


Using blog posts on Apache and Hortonworks I've been able to stream from NiFi to Spark when both are located on the same machine. Now I'm trying to stream from NiFi on one EC2 instance to an EMR cluster in the same subnet and security group and I'm running into problems. The specific error being reported by the EMR Core machine is

Failed to receive data from NiFi
java.net.ConnectException: Connection refused
    at sun.nio.ch.Net.connect0(Native Method)
    at sun.nio.ch.Net.connect(Net.java:454)
    at sun.nio.ch.Net.connect(Net.java:446)
    at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:648)
    at java.nio.channels.SocketChannel.open(SocketChannel.java:189)
    at org.apache.nifi.remote.client.socket.EndpointConnectionPool.establishSiteToSiteConnection(EndpointConnectionPool.java:708)
    at org.apache.nifi.remote.client.socket.EndpointConnectionPool.establishSiteToSiteConnection(EndpointConnectionPool.java:682)
    at org.apache.nifi.remote.client.socket.EndpointConnectionPool.getEndpointConnection(EndpointConnectionPool.java:300)
    at org.apache.nifi.remote.client.socket.SocketClient.createTransaction(SocketClient.java:129)
    at org.apache.nifi.spark.NiFiReceiver$ReceiveRunnable.run(NiFiReceiver.java:149)

Using netstat on the core machine I see it does have an open TCP connection to the NiFi box on the site-to-site port (in my case 8090). On the NiFi machine, in the nifi-app.log file, I see logs from the "Site-to-Site Worker Thread" about my core machine making connection (and nothing about any errors). So the initial connection seems to be successful but not much after that.

When I ran my Spark code locally I was on the NiFi EC2 instance, so I know that in general it works. I'm just hitting something, probably security related, once the client is an EMR cluster.

As a work around I can post a file to S3 and then launch a Spark step from NiFi (using a Python script), but I'd much rather stream the data (and using Kafka isn't an option). Has anyone else gotten streaming from NiFi to EMR working?

This post is similar: Getting data from Nifi to spark streaming the difference being I have security turned off and I'm using http, not https (and I'm getting connection refused as opposed to a 401).

Edit:

nifi.properties:

# Site to Site properties
nifi.remote.input.host=
nifi.remote.input.secure=false
nifi.remote.input.socket.host=
nifi.remote.input.socket.port=8090
nifi.remote.input.http.enabled=true
nifi.remote.input.http.transaction.ttl=30 sec

Solution

  • Bryan Bende had the solution in a comment above: once I set nifi.remote.input.host to the IP address of the current machine streaming started working.