Search code examples
hadoophbaseetlapache-nifi

Sync database extraction with Hadoop


Lets say you have periodic task that extract data from a database and loads that data into Hadoop.

How does Apache Sqoop/Nifi mantain database sync between the source database (SQL or NoSQL) with destination storage(Hadoop HDFS or HBASE, even S3)?

For example, lets say that at time A the database has 500 records and at time B it has 600 records with some of the old records updated, does it have a mechanism that efficiently knows the difference between time A and time B that only updates rows that changed and add missing rows?


Solution

  • Yes,NiFi has QueryDatabaseTable processor which can store the state and incrementally fetches the records that got updated.

    in your table if you are having some date column that can be updated when your records gets updated then you can use the same date column in Max value columns property then processor will pulls only the changes that got made from last state value.

    Here is the awesome article regarding querydatabasetable processor https://community.hortonworks.com/articles/51902/incremental-fetch-in-nifi-with-querydatabasetable.html