We are planning to create a new processing mechanism which consists of listening to a few directories e.g: /opt/dir1, /opt/dirN
and for each document create in these directories, start a routine to process, persist it's registries in a database (via REST calls to an existing CRUD API) and generate a protocol file to another directory.
For testing purposes, I am not using any modern (or even decent) framework/approach, just a regular SpringBoot app with WatchService implementation that listens to these directories and poll the files to be processed as soon as they are created. It works but, clearly I am most definitely having some performance implications at some time when I move to production and start receiving dozens of files to be processed in parallel, which isn't a reality in my example.
After some research and some tips from a few colleagues, I found Spring Batch + Spring Cloud Data Flow to be the best combination for my needs. However, I have never dealt with neither of Batch or Data Flow before and I'm kinda confuse on what and how I should build these blocks in order to get this routine going in the most simple and performatic manner. I have a few questions regarding it's added value and architecture and would really appreciate hearing your thoughts!
I managed to create and run a sample batch file ingest task based on this section of Spring Docs. How can I launch a task every time a file is created in a directory? Do I need a Stream for that?
If I do, How can I create a stream application that launches my task programmaticaly for each new file passing it's path as argument? Should I use RabbitMQ for this purpose?
How can I keep some variables externalized for my task e.g directories path
? Can I have these streams and tasks read an application.yml somewhere else than inside it's jar?
Why should I use Spring Cloud Data Flow alongside Spring Batch and not only a batch application? Just because it spans parallel tasks for each file or do I get any other benefit?
Talking purely about performance, how would this solution compare to my WatchService + plain processing implementation if you think only about the sequential processing scenario, where I'd receive only 1 file per hour or so?
Also, if any of you have any guide or sample about how to launch a task programmaticaly, I would really thank you! I am still searching for that, but doesn't seem I'm doing it right.
Thank you for your attention and any input is highly appreciated!
UPDATE
I managed to launch my task via SCDF REST API so I could keep my original SpringBoot App using WatchService launching a new task via Feign or XXX. I still know this is far from what I should do here. After some more research I think creating a stream using file source and sink would be my way here, unless someone has any other opinion, but I can't get to set the inbound channel adapter to poll from multiple directories and I can't have multiple streams, because this platform is supposed to scale to the point where we have thousands of particiants (or directories to poll files from).
Here are a few pointers.
I managed to create and run a sample batch file ingest task based on this section of Spring Docs. How can I launch a task every time a file is created in a directory? Do I need a Stream for that?
If you'd have to launch it automatically upon an upstream event (eg: new file), yes, you could do that via a stream (see example). If the events are coming off of a message-broker, you can directly consume them in the batch-job, too (eg: AmqpItemReader).
If I do, How can I create a stream application that launches my task programmaticaly for each new file passing it's path as argument? Should I use RabbitMQ for this purpose?
Hopefully, the above example clarifies it. If you want to programmatically launch the Task (not via DSL/REST/UI), you can do so with the new Java DSL support, which was added in 1.3.
How can I keep some variables externalized for my task e.g directories path? Can I have these streams and tasks read an application.yml somewhere else than inside it's jar?
The recommended approach is to use Config Server. Depending on the platform where this is being orchestrated, you'd have to provide the config-server credentials to the Task and its sub-tasks including batch-jobs. In Cloud Foundry, we simply bind config-server service instance to each of the tasks and at runtime the externalized properties would be automatically resolved.
Why should I use Spring Cloud Data Flow alongside Spring Batch and not only a batch application? Just because it spans parallel tasks for each file or do I get any other benefit?
Ad a replacement for Spring Batch Admin, SCDF provides monitoring and management for Tasks/Batch-Jobs. The executions, steps, step-progress, and stacktrace upon errors are persisted and available to explore from the Dashboard. You can directly also use SCDF's REST endpoints to examine this information.
Talking purely about performance, how would this solution compare to my WatchService + plain processing implementation if you think only about the sequential processing scenario, where I'd receive only 1 file per hour or so?
This is implementation specific. We do not have any benchmarks to share. However, if performance is a requirement, you could explore remote-partitioning support in Spring Batch. You can partition the ingest or data processing Tasks with "n" number of workers, so that way you can achieve parallelism.