Search code examples
csvapache-nifi

Apache Nifi - Getting unique records from CSV files


I have two csv files and both files have records. I want to delete duplicate records. I want to get unique records. How can I do it with Apache Nifi?

Thank you !

input1.csv ;

id,surname,name
1,ali,veli
2,mert,tolga

input2.csv ;

id,surname,name
1,ali,veli
3,ahmet,ozan

output.csv ;

id,surname,name
1,ali,veli
2,mert,ayşe
3,ahmet,ozan

Solution

  • You can do this by doing Record based processing and combine the MergeRecord to merge the two csv files into one and then you can use QueryRecord processor for deduplication with query like:

    SELECT * FROM FLOWFILE
    INTERSECT
    SELECT * FROM FLOWFILE
    

    SELECT DISTINCT FROM FLOWFILE will not work. Here are Calcite docs https://calcite.apache.org/docs/reference.html

    So you would need:

    • CsvReader controller with ignore header set to true.
    • CsvRecordWriter controller
    • MergeRecord
    • QueryRecord

    on the output on the QueryRecord you will get deduplicated CSV file.

    enter image description here

    The output:

    enter image description here