In my usecase getting set of matching filepattern from Kafka,
PCollection<String> filepatterns = p.apply(KafkaIO.read()...);
Here each pattern could match upto 300+ files.
Q1. How can I use TextIO.Read()
to match data from PCollection
, as withHintMatchesManyFiles()
available only for TextIO.Read()
not for TextIO.ReadFiles()
.
Q2. If path via FileIO.Match->FileIO.ReadMatch()->TextIO.ReadFiles() is used, withHintMatchesManyFiles()
isn't available in this path, how it will impact the read performance?
Q3. Is there any other optimized path for above usecase?
Yes, you can't have withHintMatchesManyFiles()
with TextIO.ReadFiles()
out of the box. Actually, TextIO.Read().withHintMatchesManyFiles()
is implemented via FileIO
transforms + TextIO.ReadFiles()
(see details). In this way, FileIO.readMatches()
should distribute the files reading over the workers.
So, I think you can use the same approach while reading file names from Kafka topic.