Search code examples
apache-flink

Apache Flink: How can I create a parallel JDBC InputFormat?


There is a module named flink-jdbc which only supports non-parallel tuple type based JDBC InputFormat.

In order to use a parallel InputFormat for JDBC, it seems one needs to customize by implementing the interface: org.apache.flink.core.io.InputSplit.

So in my case, how can I custom implement JdbcInputSplit to query data in parallel from the database?


Solution

  • Apache Flink does not provide a parallel JDBC InputFormat. So you need to implement one yourself. You can use the non-parallel JDBC InputFormat as a starting point.

    In order to query a database in parallel, you need to split the query into several queries that cover non-overlapping (and ideally equally-sized) parts of the result set. Each of these smaller queries would be wrapped in an InputSplit and handed to a parallel instance of the input format.

    Splitting the query is the challenging part as it depends on the query and the data. So you need a bit of meta information to come up with good splits. You might want to delegate this to the user of the input format and ask for a set of queries instead of a single one. You should also check that the queried database handles parallel requests better than a single query.