Search code examples
apache-nifi

Pull Data from Hive to SQL Server without duplicates using Apache Nifi


Sorry I'm new in Apache Nifi. So i made a data flow regarding pulling data from Hive and storing it in SQL. There is no error on my data flow, the only problem is, its pulling data repeatedly.

My Data flow is consists of the following:

  1. SelectHiveQL
  2. SplitAvro
  3. ConvertAvroToJson
  4. ConvertJsonTOSQL
  5. PutSQL

For example my table in hive have 20 rows only but when i run the data flow and check my table in MS SQL. It saved 5,000 rows. The SelectHiveQL pulled the data repeatedly.

What do i need to do so it will only pull 20 rows or just the exact number of rows in my Hive Table?

Thank you


Solution

  • SelectHiveQL (like many NiFi processors) runs on a user-specified schedule. To get a processor to only run once, you can set the run schedule to something like 30 sec, then start and immediately stop the processor. The processor will be triggered once, and stopping it does not interrupt that current execution, it just causes it not to be scheduled again.

    Another way might be to set the run schedule to something very large, such that it would only execute once per some very long time interval (days, years, etc.)