What are the advantages of using Nifi + Kafka with the Twitter API?

I was looking for architectures to do sentiment analysis in streaming with Spark and I came across this architecture.

I was wondering, what are the advantages of using Nifi + Kafka with the Twitter API instead of directly connecting Spark to it, I assume it would be more fault tolerant like this, but I really just don't know.

Solution

NiFi is a data integration tool - it moves data. It's great for taking data from a source (e.g. Twitter) and writing it to a destination (e.g. Kafka).

In general, NiFi excells at continuously pulling from a source and pushing to a destination (but you can also push to NiFi, and pull from NiFi by creating endpoints in your flows).

In your case, you are pulling from Twitter - how are you going to pull it, and then how is that data going to be delivered to Spark? Generally speaking, Spark wants to pull from a source.

NiFi has a lot of built-in features for integrating with data sources, including pulling from Twitter. By using NiFi, you do not have to write that functionality yourself.

NiFi does not have a well known protocol to push data to/pull data from, because this is not the purpose of NiFi. You can build that functionality yourself inside NiFi, for example by creating HTTP endpoints in your NiFi flows, or by using NiFi's Site-to-Site protocol, but you're now going into less well-trodden paths and adding a lot of work for yourself.

However, Kafka has a well known protocol, and Spark has very good integrations with Kafka as a streaming source. You can connect the two very easily with little custom work.

NiFi also integrates very well with Kafka as a destination for data.

Thus, NiFi out of the box can handle the Twitter -> Kafka, while Spark can handle out of the box consuing from Kafka. You do not have to write much, if any, custom code to handle getting your Twitter data.

Of course, Kafka also adds all of its well-understoof benefits for this use-case, many of which aren't present in NiFi (because NiFi is not a message broker and is not trying to provide the same features).