Search code examples
apache-flink

Moving data between paradigms in Flink


I have some batch processed data in a relational database that I want to push to a message bus using Flink. As Flink supports both Batch and Streaming paradigms it appears to be a good fit. Having said that, I cannot tell if this task would belong to the StreamingJob or the BatchJob and how to connect the two. Is this task better suited to a FlinkSQL environment?

Is this possible? what do I need to be aware of?


Solution

  • It really depends on what You actually want to do, the size of the data and so on. If You just want to read data from a database and write it to say Kafka, You may want to take a look at flink here, since it has the sources for JDBC already implemented.

    You could technically use DataStream API, but it doesn't have JDBC source AFAIK, so it would be harder, since You would need to implement that Yourself.

    Generally, Flink has two main APIs for handling data, DataSet and DataStream and You can't convert between them. DataSet is more or less what what DataSet in Spark is, so basically Flink knows that this is bounded, partitioned data, so You can for example sort that or do other transformations that exploit the fact that the data has some defined size.

    DataStream is a Streaming API, which in general allows You to handle technically infinite streams of data, You may use DataStream to create finite stream, by for example reading from file. But, in general when handling data with DataStream You will not be able to directly sort it or do some things You might be able to do when dealing with DataSet, even if the data held in stream is finite.

    So, as for the moving data between paradigms, You can write the application that processes DataStream and You will be able to work both with infinite stream of events from Kafka and with 10 records from CSV file, Flink will be able to do some optimizations based on whether the data is finite or not (introduced in 1.12 IIRC). But You won't be able to read a file into DataSet sort partitions and then map it to DataStream, without doing some additional work, like for example storing the data in file and reading it again as DataStream.