Search code examples

Process continuously parquet files as Datastreams in Flink's DataStream API

I have a parquet file on HDFS. It is overwritten daily with a new one. My goal is to emit this parquet file continuously - when it changes - as a DataStream in a Flink Job using the DataStream API. The end goal is to use the file content in a Broadcast State, but this is out of scope for this question.

  1. To process a file continuously, there is this very useful API: Data-sources about datasources. More specifically, FileProcessingMode.PROCESS_CONTINUOUSLY: this is exactly what I need. This works for reading/monitoring text files, no problem, but not for parquet files:
// Partial version 1: the raw file is processed continuously
val path: String = "hdfs://hostname/path_to_file_dir/"
val textInputFormat: TextInputFormat = new TextInputFormat(new Path(path))
// monitor the file continuously every minute
val stream: DataStream[String] = streamExecutionEnvironment.readFile(textInputFormat, path, FileProcessingMode.PROCESS_CONTINUOUSLY, 60000)
  1. To process parquet files, I can use Hadoop Input Formats using this API: using-hadoop-inputformats. However there is no FileProcessingMode parameter via this API, and this processes the file only once:
// Partial version 2: the parquet file is only processed once
val parquetPath: String = "/path_to_file_dir/parquet_0000"
// raw text format
val hadoopInputFormat: HadoopInputFormat[Void, ArrayWritable] = HadoopInputs.readHadoopFile(new MapredParquetInputFormat(), classOf[Void], classOf[ArrayWritable], parquetPath)
val stream: DataStream[(Void, ArrayWritable)] = streamExecutionEnvironment.createInput(hadoopInputFormat).map { record =>
  // process the record here ...

I would like to somehow combine the two APIs, to process continuously Parquet Files via the DataStream API. Have any of you tried something like this ?


  • After browsing Flink's code, looks like that those two APIS are relatively different, and it does not seem possible to merge them together.

    The other approach, which I will detail here, is to define your own SourceFunction that will periodically read the file:

    class ParquetSourceFunction extends SourceFunction[Int] {
      private var isRunning = true
      override def run(ctx: SourceFunction.SourceContext[Int]): Unit = {
        while (isRunning) {
          val path = new Path("path_to_parquet_file")
          val conf = new Configuration()
          val readFooter = ParquetFileReader.readFooter(conf, path, ParquetMetadataConverter.NO_FILTER)
          val metadata = readFooter.getFileMetaData
          val schema = metadata.getSchema
          val parquetFileReader = new ParquetFileReader(conf, metadata, path, readFooter.getBlocks, schema.getColumns)
          var pages: PageReadStore = null
          try {
            while ({ pages = parquetFileReader.readNextRowGroup; pages != null }) {
              val rows = pages.getRowCount
              val columnIO = new ColumnIOFactory().getColumnIO(schema)
              val recordReader = columnIO.getRecordReader(pages, new GroupRecordConverter(schema))
              (0L until rows).foreach { _ =>
                val group =
                val my_integer = group.getInteger("field_name", 0)
          // do whatever logic suits you to stop "watching" the file
      override def cancel(): Unit = isRunning = false

    Then, use the streamExecutionEnvironment to register this source:

    val dataStream: DataStream[Int] = streamExecutionEnvironment.addSource(new ParquetProtoSourceFunction)
    // do what you want with your new datastream