Search code examples
google-cloud-storageglobgoogle-cloud-dataflowapache-beam

TextIO. Read multiple files from GCS using pattern {}


I tried using the following

TextIO.Read.from("gs://xyz.abc/xxx_{2017-06-06,2017-06-06}.csv")

That pattern didn't work, as I get

java.lang.IllegalStateException: Unable to find any files matching StaticValueProvider{value=gs://xyz.abc/xxx_{2017-06-06,2017-06-06}.csv}

Even though those 2 files do exist. And I tried with a local file using a similar expression

TextIO.Read.from("somefolder/xxx_{2017-06-06,2017-06-06}.csv")

And that did work just fine.

I would've thought there would be support for all kinds of globs for files in GCS, but nope. Why is that? is there away to accomplish what I'm looking for?


Solution

  • This may be another option, in addition to Scott's suggestion and your comment on his answer:

    You can define a list with the paths you want to read and then iterate over it, creating a number of PCollections in the usual way:

    PCollection<String> events1 = p.apply(TextIO.Read.from(path1));
    PCollection<String> events2 = p.apply(TextIO.Read.from(path2));
    

    Then create a PCollectionList:

    PCollectionList<String> eventsList = PCollectionList.of(events1).and(events2);
    

    And then flatten this list into your PCollection for your main input:

    PCollection<String> events = eventsList.apply(Flatten.pCollections());