Search code examples
google-cloud-dataflowapache-beamdataflowbigtablegoogle-cloud-bigtable

How to filter the oldest cell in row with Cloud BigTable connector for DataFlow?


I'm trying to retrieve the oldest cell of a certain row in BigTable in my DataFlow pipeline (using Beam SDK 2.4.0). However I can't seem to find any type of filter that would allow me to do this?

Further down the pipeline the value of the oldest cell would be used in conjunction with the newest cell and be written to BigQuery. This is what I have so far to retrieve the most recent cell:

input.apply("Read protos from BigTable", BigtableIO.read()
                .withProjectId(config.getBigtableProject())
                .withInstanceId(config.getBigtableInstance())
                .withTableId(this.bigTableId)
                .withRowFilter(RowFilter.newBuilder()
                        .setFamilyNameRegexFilter("proto")
                        .setCellsPerColumnLimitFilter(1)
                        .build()))
     .apply("Row to TableRow", ParDo.of(new DoFn<Row, TableRow>() { ...

I would expect there to be something similar, selecting 1 cell but in reverse order?

Any ideas?


Solution

  • This feature is possible, but there's no easy easy answer. In general, Bigtable only allows one form of ordering. In the case of cells, the version ordering is largest to smallest.

    If you want to get a notion of "oldest", you can do one of the following:

    1. Read all of the cells, and get the oldest one.
    2. Reverse the ordering of the cells. Explicitly set Long.MAX_VALUE - now when you write, and then you can use the standard ordering.
    3. Read all of the cells, but use the "strip value" filter so that you don't return all of the data, and then follow up with another read for each row with a filter for the "oldest" timestamp you found in the first read.