Search code examples
google-cloud-dataflowapache-beamcdc

Consolidating a single row filling latest value from PCollection of CDC records using Apache Beam dataflow


Change data capture (CDC) records will not have all values for the columns in a record. It is possible that for a record's primary key, say R1 we can have PCollection of CDC records for R1 with CDC Timestamp.
Ex. If Record R1 has columns C1, C2, C3, C4, CDCTimestamp

CDC records will be
R1,C1.1,--,C3.1,--,10:02
R1,C1.2,C2.2,--,C4.2,10:03
R1,--,C2.3,--,C4.3,10:04
R2,C2.1,--,C3.1,--,10:03

When Beam pipeline processes I need to get output as follows PCollection containing R1,C1.2,C2.3,C3.1,C4.3,10:04
R2,C2.1,--,C3.1,--,10:03

Any pointers would be appreciated! Thanks.


Solution

  • Not sure if I understand your question correctly.

    What about

    1) first transform it into KV pairs:

    R1: C1.1,--,C3.1,--,10:02
    R1: C1.2,C2.2,--,C4.2,10:03
    R1: --,C2.3,--,C4.3,10:04
    R2: C2.1,--,C3.1,--,10:03
    

    2) Then do a GroupBykey:

    R1-> C1.1,--,C3.1,--,10:02
         C1.2,C2.2,--,C4.2,10:03
         --,C2.3,--,C4.3,10:04
    R2-> C2.1,--,C3.1,--,10:03
    

    3) Then for each key, process the multiple elements and convert them into one.