Search code examples

Consuming messages from Google Pubsub and publishing it to Kafka

I am trying to consume Google PubSub messages using synchronous PULL API. This is available in Apache Beam Google PubSub IO connector library. I want to write the consumed messages to Kafka using KafkaIO. I want to use FlinkRunner to execute the job, since we run this application outside GCP.

The problem I am facing is that the consumed messages are not getting ACK'd in GCP PubSub. I have confirmed that the local Kafka instance has the messages consumed from GCP PubSub. The documentation in GCP DataFlow indicates that the data bundle gets finalized when the pipeline is terminated with a data sink, which is Kafka in my case.

But since code is running in Apache Flink and not GCP DataFlow, I think some sort of callback is not getting fired related to ACK'ing the committed message.
What am I doing wrong here?

                    .apply("Read  GCP PubSub Messages", PubsubIO.readStrings()
                    .apply("Filter-1", ParDo.of(new FilterTextFn()))
                    .apply(AsJsons.of(User.class).withMapper(new ObjectMapper()))
                    .apply("Write to Local Kafka",


  • The way I fixed this solution was by using Guillaume Blaquiere's ( suggestion of looking at Checkpoints. Even after adding the Window.into() function in the pipeline, the source PubSub subscription endpoint did not receive ACKs.
    The problem was in the Flink server configuration I had failed to mention checkpoint configuration. Without these parameters, checkpoints are disabled.

    state.backend: rocksdb
    state.checkpoints.dir: file:///tmp/flink-1.9.3/state/checkpoints/

    These configs should go in the flink_home/conf/flink-conf.yaml. After adding these entries and restarting flink. All the backlogged (unack'd messages) went to 0 in the GCP pubsub monitoring chart.