Search code examples
google-bigqueryapache-beamdataflow

How can I filter duplicate TableRow data for big query to remove duplicate rows?


Im new to Dataflow so forgive me if my question is funny, I have a csv file I am reading and it has repeated rows, I am reading this data and writing to big query, however I don't want to write duplicate data to my BQ table.

I have thought of 1 approach but I dont know how to implement it, It involves adding some sort of flag to the schema to mark it unique but I dont know how

Lists.newArrayList(
  new TableFieldSchema()
         .setName("person_id")
         .setMode("NULLABLE").setType("STRING"),
  new TableFieldSchema()
         .setName("person_name")
         .setMode("NULLABLE")
         .setType("STRING") // Cant I add another unique property here?
) 

Dont know if that method will work, But all I need is to filter the rows retrieved from a transformation like

PCollection<TableRow> peopleRows = 
  pipeline
     .apply(
        "Convert to BiqQuery Table Row",
        ParDo.of(new FormatForBigquery())

    // Next step to filter duplicates

Solution

  • If we think of the output for your reading of the CSV as a PCollection then we can eliminate duplicates by passing the PCollection through the Distinct transform. The purpose of this transform is to take the input PCollection and generate a new PCollection that is the original PCollection with no duplicates. As part of the Distinct pre-made transform, there is the opportunity to specify ones own function that will be invoked to determine what classifies a two PCollection objects as equal and hence which ones to remove.