I want to receive data from multiple url. You can think, each url is represent one device. I can create flow that starting with getHttp for each device. But this scenario so bad for me. Another option, i can create flow that starting with generateFlowFile(each url defined in this processor), then split, and send this urls to invokeHttp processor. But each url will work in a sequential. So, I can loss data from others when i send request to one url.
What can i do in this case?
Edit: For my use case, firstly, i must receive data from multiple url. Then i will send these data to Kafka after apply some transformations. But I have to get data from almost 50 or more URLs. I need to do this in real time and in a scalable way in a nifi cluster.
Use the same flow as mentioned in the question:
Described Flow in Question:
1.GenerateFlowFile
2.Split Text
3.Extract Text
Then feed the success relationship of ExtractText processor to RemoteProcessorGroup
(to distribute the load across clusted).
Then get the flowfile that are distributed feed them to InvokeHTTP
processor and schedule the processor to run more than one concurrent tasks in Scheduling Tab.
Then use PublishKafkaRecord
processors and define Record Reader/Writer schema, Change the schedule to run more than one concurrent task.
Final Flow:
1.GenerateFlowFile
2.SplitText
3.ExtractText
4.RemoteProcessorGroup (or) ConnectionLoadBalance(Starting NiFi-1.8.0)
5.InvokeHTTP //more than one concurrent task
6.RemoteProcessorGroup (or) ConnectionLoadBalance(Starting NiFi-1.8.0) //optinal
7.PublishKafkaRecord //more than one concurrent task
Try with the above flow and i believe Kafka processors are very scalable, give you good performance as you are expected :)
In addition
Starting from NiFi-1.8 version we don't
need to use RemoteProcessGroup
(to distribute the load) as we can configure Connections(relationships) to distribute the load balancing.
Refer to this and NiFi-5516 links for more details regards to these new additions
in NiFi-1.8 version.