Search code examples
google-cloud-dataflowgoogle-cloud-bigtable

Using the MultiRowRangeFilter in Google Bigtable


I've been trying to use the MultiRowRangeFilter in Google Bigtable, but I didn't manage to make it work properly. What I'm basically doing is scanning and processing different ranges from Bigtable using Dataflow.

List<RowRange> ranges = getRanges();
MultiRowRangeFilter filter = new MultiRowRangeFilter(ranges);

Scan scan = new Scan();    
scan.setFilter(filter);

config = CloudBigtableScanConfiguration.Builder()
                .withProjectId("my-project")
                .withInstanceId("my-instance")
                .withTableId("my-table")
                .withScan(scan)
                .build();

DataflowPipelineOptions options = PipelineOptionsFactory.as(DataflowPipelineOptions.class);
options.setProject("my-project");
options.setStagingLocation("gs://my-bucket");
options.setRunner(DataflowRunner.class);

Pipeline p = Pipeline.create(options);
p.apply(Read.from(CloudBigtableIO.read(config)))
                .apply(ParDo.of(new MyFunction()))
                .apply(TextIO.write().to("gs://output-bucket"));

getRanges is a function that returns a List<RowRange> that have been initialized like this:

RowRange range = new RowRange("1388710#1823246", true, "1388710#1823302", true);

Instead of scanning and returning only the ranges that I'm interested in the scan returns all the data I have in my table.

Any idea what I've been doing wrong ?


Solution

  • Per discussion in the comments, MultiRowRangeFilter currently doesn't work with Cloud Dataflow, and the feature request is tracked in GitHub here:

    https://github.com/googleapis/cloud-bigtable-client/issues/1239