Search code examples
spring-cloud-streamspring-cloud-dataflowspring-cloud-task

How to handle global resources in spring cloud dataflow?


I'm learning the concepts of spring cloud dataflow and wondering what is the common way of storing global resources.

For example, when I have a stream with a pmml-processor and I would like to retrain the underlying pmml-model periodically via a spring-cloud-task.

Where would I store the model, so that it can be used as a (read-only)-resource by the processor and updated by the task every night? Is there a concept of a global storage in spring cloud dataflow? Should I just use a traditional database outside of the spring-cloud or is there a better way?


Solution

  • There is no general concept of shared storage within Spring Cloud Data Flow itself, but the Spring Resource used to provide the model for the PMML processor is pretty flexible (see http://docs.spring.io/spring/docs/current/spring-framework-reference/html/resources.html and in particular Table 8.1 for a few path options that can be used for the pmml.model-location parameter). So there are a couple options out of the box:

    • use a shared filesystem (which could be then accessed via the file:// protocol);
    • store the model in a location accessible that can be served as a static resource via HTTP;

    Additional options (which require including additional jars in the application) are available for S3 (via https://cloud.spring.io/spring-cloud-aws/) and HDFS (via Spring for Apache Hadoop - see http://docs.spring.io/spring-hadoop/docs/current/reference/htmlsingle/#using-hdfs-resource-loader).