apache-flink flink-streaming flink-batch

Flink read data from Hadoop and publish to Kafka

I have a requirement to read data from HDFS and publish it to a Kafka topic. Because they are part of DataSet and DataStream APIs, is it possible to do what I'm looking for in a single job?

Solution

Flink's DataStream API can be used to read from HDFS files. See readfile() in https://ci.apache.org/projects/flink/flink-docs-stable/dev/datastream_api.html#data-sources. Or you can use the file system connector with the Table and SQL APIs, but it only supports CSV.