There is a module named flink-jdbc
which only supports non-parallel tuple type based JDBC InputFormat
.
In order to use a parallel InputFormat
for JDBC
, it seems one needs to customize by implementing the interface: org.apache.flink.core.io.InputSplit
.
So in my case, how can I custom implement JdbcInputSplit
to query data in parallel from the database?
Apache Flink does not provide a parallel JDBC InputFormat. So you need to implement one yourself. You can use the non-parallel JDBC InputFormat as a starting point.
In order to query a database in parallel, you need to split the query into several queries that cover non-overlapping (and ideally equally-sized) parts of the result set. Each of these smaller queries would be wrapped in an InputSplit and handed to a parallel instance of the input format.
Splitting the query is the challenging part as it depends on the query and the data. So you need a bit of meta information to come up with good splits. You might want to delegate this to the user of the input format and ask for a set of queries instead of a single one. You should also check that the queried database handles parallel requests better than a single query.