Search code examples
apache-beamapache-beam-ioapache-beam-internals

TextIO.Read().From() vs TextIO.ReadFiles() over withHintMatchesManyFiles()


In my usecase getting set of matching filepattern from Kafka,

PCollection<String> filepatterns = p.apply(KafkaIO.read()...);

Here each pattern could match upto 300+ files.

Q1. How can I use TextIO.Read() to match data from PCollection, as withHintMatchesManyFiles() available only for TextIO.Read() not for TextIO.ReadFiles().

Q2. If path via FileIO.Match->FileIO.ReadMatch()->TextIO.ReadFiles() is used, withHintMatchesManyFiles() isn't available in this path, how it will impact the read performance?

Q3. Is there any other optimized path for above usecase?


Solution

  • Yes, you can't have withHintMatchesManyFiles() with TextIO.ReadFiles() out of the box. Actually, TextIO.Read().withHintMatchesManyFiles() is implemented via FileIO transforms + TextIO.ReadFiles() (see details). In this way, FileIO.readMatches() should distribute the files reading over the workers.

    So, I think you can use the same approach while reading file names from Kafka topic.