Search code examples
pythongoogle-cloud-platformapache-beamdataflow

Python alternative to Java DataFlow Streaming code


I have the following java code snippet which polls the gcs bucket for the arrival of new files. This is the code that I am using for my streaming pipeline, which would further load the data after applying some transformations into some destination.

PCollection<String> pcollection = pipeline.apply("Read From streaming source",
                TextIO.read().from("gs://abc/xyz")
                        .watchForNewFiles(Duration.standardSeconds(10), Watch.Growth.never()));

but for a particular use case I need to achieve the same thing using python, for which I am unable to find the required libraries, also all the implementations for the streaming pipelines in python are for PubSub. Running the python pipeline with streaming=true doesn't solve any purpose as the code exits after completion and doesn't wait for new files. Can someone suggest a way to proceed? Thanks in advance.


Solution

  • You need MatchContinuously (doc), which was added in this Pull Request.

    Edit:

    Given that I saw you had some issues with the PTransform (I think you deleted the comments), I am going to add a sample code:

    (p | MatchContinuously("gs://apache-beam-samples/shakespeare/*", interval=10.0)
       | Map(lambda x: x.path)
       | ReadAllFromText()
       | Map(lambda x: logging.info(x))
    )