Change data capture (CDC) records will not have all values for the columns in a record. It is possible that for a record's primary key, say R1 we can have PCollection of CDC records for R1 with CDC Timestamp.
Ex. If Record R1 has columns C1, C2, C3, C4, CDCTimestamp
CDC records will be
R1,C1.1,--,C3.1,--,10:02
R1,C1.2,C2.2,--,C4.2,10:03
R1,--,C2.3,--,C4.3,10:04
R2,C2.1,--,C3.1,--,10:03
When Beam pipeline processes I need to get output as follows
PCollection containing
R1,C1.2,C2.3,C3.1,C4.3,10:04
R2,C2.1,--,C3.1,--,10:03
Any pointers would be appreciated! Thanks.
Not sure if I understand your question correctly.
What about
1) first transform it into KV pairs:
R1: C1.1,--,C3.1,--,10:02
R1: C1.2,C2.2,--,C4.2,10:03
R1: --,C2.3,--,C4.3,10:04
R2: C2.1,--,C3.1,--,10:03
2) Then do a GroupBykey:
R1-> C1.1,--,C3.1,--,10:02
C1.2,C2.2,--,C4.2,10:03
--,C2.3,--,C4.3,10:04
R2-> C2.1,--,C3.1,--,10:03
3) Then for each key, process the multiple elements and convert them into one.