Im new to Dataflow so forgive me if my question is funny, I have a csv file I am reading and it has repeated rows, I am reading this data and writing to big query, however I don't want to write duplicate data to my BQ table.
I have thought of 1 approach but I dont know how to implement it, It involves adding some sort of flag to the schema to mark it unique but I dont know how
Lists.newArrayList(
new TableFieldSchema()
.setName("person_id")
.setMode("NULLABLE").setType("STRING"),
new TableFieldSchema()
.setName("person_name")
.setMode("NULLABLE")
.setType("STRING") // Cant I add another unique property here?
)
Dont know if that method will work, But all I need is to filter the rows retrieved from a transformation like
PCollection<TableRow> peopleRows =
pipeline
.apply(
"Convert to BiqQuery Table Row",
ParDo.of(new FormatForBigquery())
// Next step to filter duplicates
If we think of the output for your reading of the CSV as a PCollection then we can eliminate duplicates by passing the PCollection through the Distinct transform. The purpose of this transform is to take the input PCollection and generate a new PCollection that is the original PCollection with no duplicates. As part of the Distinct pre-made transform, there is the opportunity to specify ones own function that will be invoked to determine what classifies a two PCollection objects as equal and hence which ones to remove.